Mixed-Signal Artificial Neural Network Accelerator

ABSTRACT

An artificial neural network (ANN) accelerator is provided. The ANN accelerator includes digital controlled oscillators (DCOs), digital-to-time converters (DTCs) and a mixed-signal multiply-and-accumulate (MAC) array. Each DCO generates a first analog operand signal based on a first digital data value, and transmits the first analog operand signal along a respective column signal line. Each DTC generates a second analog operand signal based on a second digital data value, and transmits the second analog operand signal along a respective row signal line. The mixed-signal MAC array is coupled to the row and column signal lines, and includes mixed-signal MAC units. Each mixed-signal MAC unit includes an integrated clock gate (ICG) that generates a digital product signal based on the first and second analog operand signals, and a counter circuit that increments or decrements a count value stored in a register based on the digital product signal.

BACKGROUND

The present disclosure relates to computer systems. More particularly, the present disclosure relates to computer systems including artificial neural networks (ANNs).

ANNs, such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc., are a popular solution to a wide array of challenging classification, recognition and regression problems. However, many ANNs require a large number of calculations involving a large number of filter weights and activations, which presents a significant challenge with respect to access, storage and performance, particularly for mobile and other power or storage-constrained devices. An ANN hardware accelerator accelerates these calculations, such as, for example, general matrix multiplication (GEMM) operations performed by DNNs, convolution operations performed by CNNs, etc.

CNNs typically do not perform native convolution operations due to the complicated dataflow and expensive datapaths that are required. Instead, native convolution operations are converted into GEMM operations, which are then executed more efficiently by a central processing unit (CPU), a specialized processor, an ANN accelerator that includes systolic multiply-and-accumulate (MAC) arrays, a non-volatile memory (NVM) accelerator for neural networks, etc. An NVM accelerator includes both digital processing circuitry and one or more analog NVM crossbar arrays that perform GEMM operations, such as matrix multiply-accumulate (MAC) operations. For example, the filter weights and activations (i.e., input feature maps or IFMs) for a convolution layer of a CNN may be converted into an expanded format (e.g., IM2COL format), and then processed as GEMM operations by an ANN accelerator to generate output feature maps (OFMs). An activation or scaling function and a bias may be applied to the OFMs by the convolution layer or a separate activation layer, and then the OFMs are provided as the activations (i.e., IFMs) for the next layer of the CNN.

In an ANN accelerator with digital MAC arrays, one of the biggest hardware costs, in terms of power and area, are the flip-flops required to pipeline the movement of operands over the typically large array. Other closely-related architectures share broadly similar costs for moving operands around a large silicon area. In an NVM accelerator with analog NVM crossbar arrays, undesirable noise and linearity issues may significantly reduce the accuracy of the ANN. NVM accelerators also suffer from the prohibitive costs of ADC and DAC circuits.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an ANN, in accordance with an embodiment of the present disclosure.

FIG. 2A depicts a CNN, in accordance with an embodiment of the present disclosure.

FIG. 2B depicts a convolution layer operation within a convolutional layer of a CNN, in accordance with an embodiment of the present disclosure.

FIG. 2C depicts a converted convolutional operation within a convolutional layer of a CNN, in accordance with an embodiment of the present disclosure.

FIG. 3A depicts a data flow diagram for a digital MAC array.

FIG. 3B depicts a data flow diagram for an analog NVM crossbar array.

FIG. 4 depicts a data flow diagram for a mixed-signal MAC array, in accordance with an embodiment of the present disclosure.

FIG. 5 depicts a block diagram of a mixed-signal MAC unit, in accordance with an embodiment of the present disclosure.

FIG. 6 depicts a block diagram of a system, in accordance with an embodiment of the present disclosure.

FIG. 7 depicts a block diagram of an ANN accelerator, in accordance with an embodiment of the present disclosure.

FIG. 8 depicts a depict flow diagram representing functionality associated with performing a mixed-signal MAC operation, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION

Embodiments of the present disclosure will now be described with reference to the drawing figures, in which like reference numerals refer to like parts throughout.

Embodiments of the present disclosure advantageously improve the distribution of operands in ANN accelerators by eliminating the need for flip-flops, ADCs and DACs. More particularly, digital operands are encoded as analog signals using digitally-controlled oscillators (DCOs) and digital-to-time converters (DTCs), and then distributed for processing by an array of mixed-signal (i.e., analog and digital) MAC units. Advantageously, analog operand distribution reduces hardware cost because a single wire can encode a continuous value which could represent as much data as, for example, an 8-wire digital signal, while the digital value that is output by each element of the mixed-signal MAC array avoids expensive ADCs.

In one embodiment, a mixed-signal multiply-and-accumulate (MAC) unit includes an integrated clock gate (ICG) and a counter circuit. The ICG includes a clock port coupled to a first analog operand signal line, an enable port coupled to a second analog operand signal line, and an output port. The ICG is configured to receive, at the clock port, a first analog operand signal, receive, at the enable port, a second analog operand signal, generate a digital product signal based on the first analog operand signal and the second analog operand signal, and output, from the output port, the digital product signal. The counter circuit includes an input port coupled to the ICG output port, a register, an adder/subtractor circuit and an output port. The counter circuit is configured to receive, at the input port, the digital product signal, increment or decrement a count value stored in the register based on the digital product signal, and output, from the output port, the count value.

An ANN models the relationships between input data or signals and output data or signals using a network of interconnected nodes that is trained through a learning process. The nodes are arranged into various layers, including, for example, an input layer, one or more hidden layers, and an output layer. The input layer receives input data, such as, for example, image data, and the output layer generates output data, such as, for example, a probability that the image data contains a known object. Each hidden layer provides at least a partial transformation of the input data to the output data. A DNN has multiple hidden layers in order to model complex, nonlinear relationships between input data and output data.

In a fully-connected, feedforward ANN, each node is connected to all of the nodes in the preceding layer, as well as to all of the nodes in the subsequent layer. For example, each input layer node is connected to each hidden layer node, each hidden layer node is connected to each input layer node and each output layer node, and each output layer node is connected to each hidden layer node. Additional hidden layers are similarly interconnected. Each connection has a weight value, and each node has an activation function, such as, for example, a linear function, a step function, a sigmoid function, a tan h function, a rectified linear unit (ReLU) function, etc., that determines the output of the node based on the weighted sum of the inputs to the node. The input data propagates from the input layer nodes, through respective connection weights to the hidden layer nodes, and then through respective connection weights to the output layer nodes.

More particularly, at each input node, input data is provided to the activation function for that node, and the output of the activation function is then provided as an input data value to each hidden layer node. At each hidden layer node, the input data value received from each input layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as an input data value to each output layer node. At each output layer node, the output data value received from each hidden layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as output data. Additional hidden layers may be similarly configured to process data.

A multi-layer perceptron (MLP) is an ANN that has an input layer, an output layer and one or more hidden layers. MLPs may be used for natural language processing applications, such as machine translation, speech recognition, etc. Other ANNs include recurrent neural networks (RNNs), long short-term memories (LSTMs), sequence-to-sequence models that include an encoder RNN and a decoder RNN, shallow neural networks, etc.

A CNN is a variation of an MLP that may be used for classification or recognition applications, such as image recognition, speech recognition, etc. A CNN has an input layer, an output layer and multiple hidden layers including convolutional layers, pooling layers, normalization layers, fully-connected layers, etc. Each convolutional layer applies a sliding dot product or cross-correlation to an input volume, applies an activation function to the results, and then provides the activation or output volume to the next layer. Convolutional layers typically use the ReLU function as the activation function. In certain embodiments, the activation function is provided in a separate activation layer, such as, for example, a ReLU layer. A pooling layer reduces the dimensions of the output volume received from the preceding convolutional layer, and may calculate an average or a maximum over small clusters of data, such as, for example, 2×2 matrices. In certain embodiments, a convolutional layer and a pooling layer may form a single layer of a CNN. The fully-connected layers follow the convolutional and pooling layers, and include a flatten layer and a classification layer, followed by a normalization layer that includes a normalization function, such as the SoftMax function. The output layer follows the last fully-connected layer; in certain embodiments, the output layer may include the normalization function.

FIG. 1 depicts ANN 10, in accordance with an embodiment of the present disclosure.

ANN 10 includes input layer 20, one or more hidden layers 30, 40, 50, etc., and output layer 60. Input layer 20 includes one or more input nodes 21, 22, 23, etc. Hidden layer 30 includes one or more hidden nodes 31, 32, 33, 34, 35, etc. Hidden layer 40 includes one or more hidden nodes 41, 42, 43, 44, 45, etc. Hidden layer 50 includes one or more hidden nodes 51, 52, 53, 54, 55, etc. Output layer 60 includes one or more output nodes 61, 62, etc. Generally, ANN 10 includes N hidden layers, input layer 20 includes “i” nodes, hidden layer 30 includes “j” nodes, hidden layer 40 includes “k” nodes, hidden layer 50 includes “m” nodes, and output layer 60 includes “o” nodes. Many variations of input, hidden and output layers are clearly possible, including hidden layers that are locally-connected, rather than fully-connected, to one another.

Training an ANN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the ANN achieves a particular level of accuracy. One method is backpropagation, or backward propagation of errors, which iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network.

FIG. 2A depicts CNN 15, in accordance with an embodiment of the present disclosure. CNN 15 includes input layer 20, one or more hidden layers, such as convolutional layer 30-1, pooling layer 30-2, hidden (flatten) layer 40, hidden (classification) layer 50, etc., and output layer 60. Many other variations of input, hidden and output layers are contemplated.

Input layer 20 includes one or more input nodes 21, etc., that present the input data, such as a color image, as an input volume to the first convolutional layer, e.g., convolutional layer 30-1. The input volume is a three-dimensional matrix that has a width, a height and a depth. For example, input data that represent a color image may be presented as an input volume that is 512 pixels×512 pixels×3 channels (red, green, blue); other input volume dimensions may also be used, such as 32×32×3, 64×64×3, 128×128×3, etc., 32×32×1, 64×64×1, 128×128×1, 512×512×1, etc.

Convolutional layer 30-1 is locally-connected to input layer 20, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). For a CNN that uses a standard convolution, each node computes a dot product between the node's weights and the respective local region of the input volume to generate one element of an output volume. An activation function and a bias may be applied to each element of the output volume, and the output volume is then provided as the input volume to the next layer. The activation function and bias may be applied by each convolutional layer node or by the nodes of a subsequent locally-connected layer, such as an ReLU layer.

Pooling layer 30-2 is locally-connected to convolutional layer 30-1, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). Pooling layer 30-2 also produces an output volume that is provided as the input volume to the subsequent layer, such as, for example, another convolutional layer 30-1, a flatten layer 40, etc. In certain embodiments, convolutional layer 30-1 and pooling layer 30-2 form a single hidden layer 30. Similarly, in certain embodiments, convolutional layer 30-1, a ReLU layer and pooling layer 30-2 form a single hidden layer 30. Generally, the output volumes of the convolutional and pooling layers may be described as output feature maps, and one or more single hidden layers 30 form a feature learning portion of CNN 15.

Hidden layer 40 is a “flatten” layer that is locally-connected to pooling layer 30-2, and includes one or more hidden (flatten) nodes 41, 42, 43, 44, 45, etc. Hidden (flatten) layer 40 “flattens” the output volume produced by the preceding pooling layer 30-2 into a column vector, which is provided to the subsequent, fully-connected hidden layer 50.

Hidden layer 50 is a classification layer that is fully-connected to hidden (flatten) layer 40, and includes one or more hidden (classification) nodes 51, 52, 53, 54, 55, etc.

Output layer 60 includes one or more output nodes 61, 62, etc., and is fully-connected to hidden (classification) layer 50. Fully-connected output layer 60 receives the classification results output by hidden (classification) layer 50, and each node outputs a predicted class score. A normalization function, such as a Softmax function, may be applied to the predicted class scores by output layer 60, or, alternatively, by an additional layer interposed between hidden (classification) layer 50 and output layer 60.

Similar to ANNs, training a CNN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the CNN achieves a particular level of accuracy. As noted above, backpropagation may be used to iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network. Matrix multiplication operations, and, more particularly, MAC operations, are used extensively by CNNs, as well as other ANNs.

FIG. 2B depicts convolution layer operation 200 within convolutional layer 30-1 of CNN 15, in accordance with an embodiment of the present disclosure.

A convolutional layer generally includes M filters, C input channels, C input feature maps (i.e., one input feature map for each input channel) and M output feature maps (i.e., one output feature map for each filter). Each filter has C weight sets (i.e., each filter has a weight set for each input channel), and is convolved across the input feature maps to produce an output feature map corresponding to that filter. Convolutional layers generally require the movement of large amounts of data, generate a significant computational load, and require buffers of considerable size to store intermediate values.

In this embodiment, convolutional layer 30-1 includes four weight matrices or filters 202, i.e., filter 202 ¹, 202 ², 202 ³ and 202 ⁴, one input channel, one input feature map 204 and four output feature maps 206, i.e., 206 ¹, 206 ², 206 ³ and 206 ⁴. Each filter 202 is convolved across input feature map 204 to produce an output feature map 206 corresponding to that filter, i.e., output feature map 206 ¹ corresponds to filter 202 ¹, output feature map 206 ² corresponds to filter 202 ², output feature map 206 ³ corresponds to filter 202 ³, and output feature map 206 ⁴ corresponds to filter 202 ⁴. For illustration purposes, each filter 202 ¹, 202 ², 202 ³ and 202 ⁴ is a 2×2×1 weight matrix, input feature map 204 is a 5×5×1 input data matrix, and each output feature map 206 ¹, 206 ², 206 ³ and 206 ⁴ is a 4×4 output data matrix. In this embodiment, with a stride of 1 and no padding, the total number of MAC operations performed by convolution layer operation 200 is (2×2×1)×(4×4)×4 or 256.

For ease of discussion, input feature map 204 may be divided into four overlapping portions or quadrants. The first quadrant (i.e., a_(q1)) includes the first and second rows, i.e., a₁, a₂, a₃, a₄, a₅ and a₆, a₇, a₈, a₉, a₁₀, the second quadrant includes the second and third rows, i.e., a₆, a₇, a₈, a₉, a₁₀ and a₁₁, a₁₂, a₁₃, a₁₄, a₁₅, the third quadrant includes the third and fourth rows, i.e., a₁₁, a₁₂, a₁₃, a₁₄, a₁₅ and a₁₆, a₁₇, a₁₈, a₁₉, a₂₀, and the fourth quadrant includes the fourth and fifth rows, i.e., a₁₆, a₁₇, a₁₈, a₁₉, a₂₀ and a₂₁, a₂₂, a₂₃, a₂₄, a₂₅.

Similarly, output feature maps 206 may be divided into four, three dimensional portions or quadrants. The first quadrant (i.e., o_(q1)) includes the first row of each output feature map 206 ¹, 206 ², 206 ³ and 206 ⁴, i.e., o¹ ₁, o¹ ₂, o¹ ₃, o¹ ₄, o² ₂, o² ₂, o² ₂, o² ₂, o³ ₃, o³ ₃, o³ ₃, o³ ₃, o⁴ ₄, o⁴ ₄, o⁴ ₃ and o⁴ ₄. The second quadrant (not shown for clarity) includes the second row of each output feature map 206 ¹, 206 ², 206 ³ and 206 ⁴, i.e., o¹ ₅, o¹ ₆, o¹ ₇, o¹ ₈, o² ₂, o² ₂, o² ₂, o² ₂, o³ ₃, o³ ₃, o³ ₃, o³ ₃, o⁴ ₄, o⁴ ₄, o⁴ ₇ and o⁴ ₈. The third quadrant (not shown for clarity) includes the third row of each output feature map 206 ¹, 206 ², 206 ³ and 206 ⁴, i.e., o¹ ₉, o¹ ₁₀, o¹ ₁₁, o¹ ₁₂, o² ₉, o² ₁₀, o₂ ¹¹, o² ₁₂, o³ ₉, o³ ₁₀, o³ ₁₁, o³ ₁₂, o⁴ ₉, o⁴ ₁₀, o⁴ ₁₁, and o⁴ ₁₂. The fourth quadrant (not shown for clarity) includes the fourth row of each output feature map 206 ¹, 206 ², 206 ³ and 206 ⁴, i.e., o¹ ₁₃, o¹ ₁₄, o¹ ₁₅, o¹ ₁₆, o² ₁₃, o² ₁₄, o² ₁₅, o² ₁₆, o³ ₁₃, o³ ₁₄, o³ ₁₅, o³ ₁₆, o⁴ ₁₃, o⁴ ₁₄, o⁴ ₁₅, and o⁴ ₁₆. All of the elements from quadrants o_(q1), o_(q2), o_(q3) and o_(q4) are depicted in FIG. 2C.

The convolution operations performed on the first quadrant (i.e., a_(q1)) of input feature map 204 are now discussed in detail.

For output feature map 206 ¹, element o¹ ₁ is the dot product of filter 202 ¹ and the first block (i.e., a₁, a₂, a₆ and a₇) of the first quadrant a_(q1) of input feature map 204, element o¹ ₂ is the dot product of filter 202 ¹ and the second block (i.e., a₂, a₃, a₇ and a₈) of the first quadrant a_(q1) of input feature map 204, element o¹ ₃ is the dot product of filter 202 ¹ and the third block (i.e., a₃, a₄, a₈ and a₉) of the first quadrant a_(q1) of input feature map 204, and o¹ ₄ is the dot product of filter 202 ¹ and the fourth block (i.e., a₄, a₅, a₉ and a₁₀) of the first quadrant a_(q1) of input feature map 204.

More particularly, the dot product of filter 202 ¹ and the first block of the first quadrant a_(q1) is equal to w¹ ₁·a₁+w¹ ₂·a₂+w¹ ₃·a₆+w¹ ₄·a₇ (i.e., o¹ ₁). The dot product of filter 202 ¹ and the second block of the first quadrant a_(q1) is equal to w¹ ₁·a₂+w¹ ₂·a₃+w¹ ₃·a₇+w¹ ₄·a₈ (i.e., o¹ ₂). The dot product of filter 202 ¹ and the third block of the first quadrant a_(q1) is equal to w¹ ₁·a₃+w¹ ₂·a₄+w¹ ₃·a₈+w¹ ₄·a₉ (i.e., o¹ ₃). The dot product of filter 202 ¹ and the fourth block of the first quadrant a_(q1) is equal to w¹ ₁·a₄+w¹ ₂·a₅+w¹ ₃·a₉+w¹ ₄·a₁₀ (i.e., o¹ ₄).

For output feature map 206 ², element o² ₁ is the dot product of filter 202 ² and the first block (i.e., a₁, a₂, a₆ and a₇) of the first quadrant a_(q1) of input feature map 204, output feature map element o² ₂ is the dot product of filter 202 ² and the second block (i.e., a₂, a₃, a₇ and a₈) of the first quadrant a_(q1) of input feature map 204, output feature map element o² ₃ is the dot product of filter 202 ² and the third block (i.e., a₃, a₄, a₅ and a₉) of the first quadrant a_(q1) of input feature map 204, and output feature map element o² ₄ is the dot product of filter 202 ² and the fourth block (i.e., a₄, a₅, a₉ and a₁₀) of the first quadrant a_(q1) of input feature map 204.

More particularly, the dot product of filter 202 ² and the first block of the first quadrant a_(q1) is equal to w² ₁·a₁+w² ₂·a₂+w² ₃·a₆+w³ ₄·a₇ (i.e., o² ₁). The dot product of filter 202 ² and the second block of the first quadrant a_(q1) is equal to w² ₁·a₂+w² ₂·a₃+w² ₃·a₇ w² ₄·a₈ (i.e., o² ₂). The dot product of filter 202 ² and the third block of the first quadrant a_(q1) is equal to w² ₁·a₃+w² ₂·a₄+w² ₃·a₈+w² ₄·a₉ (i.e., o² ₃). The dot product of filter 202 ² and the fourth block of the first quadrant a_(q1) is equal to w² ₁·a₄+w² ₂·a₅+w² ₃·a₉+w² ₄·a₁₀ (i.e., o² ₄).

For output feature map 206 ³, element o³ ₁ is the dot product of filter 202 ³ and the first block (i.e., a₁, a₂, a₆ and a₇) of the first quadrant a_(q1) of input feature map 204, output feature map element o³ ₂ is the dot product of filter 202 ³ and the second block (i.e., a₂, a₃, a₇ and a₈) of the first quadrant a_(q1) of input feature map 204, output feature map element o³ ₃ is the dot product of filter 202 ³ and the third block (i.e., a₃, a₄, a₅ and a₉) of the first quadrant a_(q1) of input feature map 204, and output feature map element o³ ₄ is the dot product of filter 202 ³ and the fourth block (i.e., a₄, a₅, a₉ and a₁₀) of the first quadrant a_(q1) of input feature map 204.

More particularly, the dot product of filter 202 ³ and the first block of the first quadrant a_(q1) is equal to w³ ₁·a₁+w³ ₂·a₂+w³ ₃·a₆+w³ ₄·a₇ (i.e., o³ ₁). The dot product of filter 202 ³ and the second block of the first quadrant a_(q1) is equal to w³ ₁·a₂+w³ ₂·a₃+w³ ₃·a₇+w³ ₄·a₈ (i.e., o³ ₂). The dot product of filter 202 ³ and the third block of the first quadrant a_(q1) is equal to w³ ₁·a₃+w³ ₂·a₄+w³ ₃·a₈+w³ ₄·a₉ (i.e., o³ ₃). The dot product of filter 202 ³ and the fourth block of the first quadrant a_(q1) is equal to w³ ₁·a₄+w₂·a₅+w³ ₃·a₉+w³ ₄·a₁₀ (i.e., o³ ₄).

For output feature map 206 ⁴, element o⁴ ₁ is the dot product of filter 202 ⁴ and the first block (i.e., a₁, a₂, a₆ and a₇) of the first quadrant a_(q1) of input feature map 204, output feature map element o⁴ ₂ is the dot product of filter 202 ⁴ and the second block (i.e., a₂, a₃, a₇ and a₈) of the first quadrant a_(q1) of input feature map 204, output feature map element o⁴ ₃ is the dot product of filter 202 ⁴ and the third block (i.e., a₃, a₄, a₈ and a₉) of the first quadrant a_(q1) of input feature map 204, and output feature map element o⁴ ₄ is the dot product of filter 202 ⁴ and the fourth block (i.e., a₄, a₅, a₉ and a₁₀) of the first quadrant a_(q1) of input feature map 204.

More particularly, the dot product of filter 202 ⁴ and the first block of the first quadrant a_(q1) is equal to w⁴ ₁·a₁+w⁴ ₂·a₂+w⁴ ₃·a₆+w⁴ ₄·a₇ (i.e., o⁴ ₁). The dot product of filter 202 ⁴ and the second block of the first quadrant a_(q1) is equal to w⁴ ₁·a₂+w⁴ ₂·a₃+w⁴ ₃·a₇+w⁴ ₄·a₈ (i.e., o⁴ ₂). The dot product of filter 202 ⁴ and the third block of the first quadrant a_(q1) is equal to w⁴ ₁·a₃+w⁴ ₂·a₄+w⁴ ₃·a⁸+w⁴ ₄·a₉ (i.e., o⁴ ₃). The dot product of filter 202 ⁴ and the fourth block of the first quadrant a_(q1) is equal to w⁴ ₁·a₄+w⁴ ₂·a₅ w⁴ ₃·a₉ w⁴ ₄·a₁₀ (i.e., o⁴ ₄).

The convolution operations performed on the remaining three quadrants of input feature map 204 are done in the same manner. The second quadrant of input feature map 204 also includes four blocks, i.e., the first block of includes a₆, a₇, a₁₁ and a₁₂, the second block includes a₇, a₈, a₁₂ and a₁₃, the third block includes a₈, a₉, a₁₃ and a₁₄, and the fourth block includes a₉, a₁₀, a₁₄ and a₁₅. The third quadrant of input feature map 204 also includes four blocks, i.e., the first block of includes a₁₁, a₁₂, a₁₆ and a₁₇, the second block includes a₁₂, a₁₃, a₁₇ and a₁₈ the third block includes a₁₃, a₁₄, a₁₈ and a₁₉, and the fourth block includes a₁₄, a₁₅, a₁₉ and a₂₀. The fourth quadrant of input feature map 204 also includes four blocks, i.e., the first block of includes a₁₆, a₁₇, a₂₁ and a₂₂, the second block includes a₁₇, a₁₈, a₂₂ and a₂₃, the third block includes a₁₈, a₁₉, a₂₃ and a₂₄, and the fourth block includes a₁₉, a₂₀, a₂₄ and a₂₅.

For the second quadrant of output feature map 206 ¹, element o¹ ₅ is the dot product of filter 202 ¹ and the first block of the second quadrant of input feature map 204, element o¹ ₆ is the dot product of filter 202 ¹ and the second block of the second quadrant of input feature map 204, element o¹ ₇ is the dot product of filter 202 ¹ and the third block of the second quadrant of input feature map 204, and element o¹ ₈ is the dot product of filter 202 ¹ and the fourth block of the second quadrant of input feature map 204. For the second quadrant of output feature map 206 ², elements o² ₅, o² ₆, o² ₇, and o² ₈ are calculated in the same manner using filter 202 ². For the second quadrant of output feature map 206 ³, elements o³ ₅, o³ ₆, o³ ₇, and o³ ₈ are calculated in the same manner using filter 202 ³. For the second quadrant of output feature map 206 ⁴, elements o⁴ ₅, o⁴ ₆, o⁴ ₇, and o⁴ ₈ are calculated in the same manner using filter 202 ⁴.

For the third quadrant of output feature map 206 ¹, element o¹ ₉ is the dot product of filter 202 ¹ and the first block of the third quadrant of input feature map 204, element o¹ ₁₀ is the dot product of filter 202 ¹ and the second block of the third quadrant of input feature map 204, element o¹ ₁₁ is the dot product of filter 202 ¹ and the third block of the third quadrant of input feature map 204, and element o¹ ₁₂ is the dot product of filter 202 ¹ and the fourth block of the third quadrant of input feature map 204. For the third quadrant of output feature map 206 ², elements o² ₉, o² ₁₀, o² ₁₁, and o² ₁₂ are calculated in the same manner using filter 202 ². For the third quadrant of output feature map 206 ³, elements o³ ₉, o³ ₁₀, o³ ₁₁, and o³ ₁₂ are calculated in the same manner using filter 202 ³. For the third quadrant of output feature map 206 ⁴, elements o⁴ ₉, o⁴ ₁₀, o⁴ ₁₁, and o⁴ ₁₂ are calculated in the same manner using filter 202 ⁴.

For the fourth quadrant of output feature map 206 ¹, element o¹ ₁₃ is the dot product of filter 202 ¹ and the first block of the fourth quadrant of input feature map 204, element o¹ ₁₄ is the dot product of filter 202 ¹ and the second block of the fourth quadrant of input feature map 204, element o¹ ₁₅ is the dot product of filter 202 ¹ and the third block of the fourth quadrant of input feature map 204, and element o¹ ₁₆ is the dot product of filter 202 ¹ and the fourth block of the fourth quadrant of input feature map 204. For the fourth quadrant of output feature map 206 ², elements o² ₁₃, o² ₁₄, o² ₁₅, and o² ₁₆ are calculated in the same manner using filter 202 ². For the fourth quadrant of output feature map 206 ³, elements o³ ₁₃, o³ ₁₄, o³ ₁₅, and o³ ₁₆ are calculated in the same manner using filter 202 ³. For the fourth quadrant of output feature map 206 ⁴, elements o⁴ ₁₃, o⁴ ₁₄, o⁴ ₁₅, and o⁴ ₁₆ are calculated in the same manner using filter 202 ⁴.

An activation function and a bias may be applied to each element of output feature maps 206, which are then provided as the input feature maps 204 to the next layer. An activation function and bias may be applied after each element of output feature maps 206 is calculated, after all of the elements of output feature maps 206 are calculated, or by a subsequent locally-connected layer, such as an ReLU layer.

Similar to the fully-connected layer calculations for ANNs, convolution operations may be recast as generic matrix multiplication (GEMM) operations, and implemented in an ANN hardware accelerator using one or more arrays of MAC units, an NVM accelerator with one or more analog NVM crossbar arrays, etc. The filter weights and activations (i.e., input feature maps or IFMs) for the convolution operation are converted into an expanded format (e.g., IM2COL format), and then processed as GEMM operations by the ANN hardware accelerator to generate output feature maps (OFMs).

FIG. 2C depicts a converted convolutional operation 210 within convolutional layer 30-1 of CNN 15, in accordance with an embodiment of the present disclosure.

In this embodiment, convolution layer operation 200 has been converted into a simple matrix multiplication operation by converting filter 202 into converted weight matrix 212, converting input feature map 204 into converted input data matrix 214, and converting output feature maps 206 into converted output data matrix 216. Converted weight matrix 212 (4×4) and converted input data matrix 214 (4×16) are multiplied to generate converted output data matrix 216 (4×16), which includes output data sets 216 ¹, 216 ², 216 ³ and 216 ⁴ (each 1×16). Output data sets 216 ¹, 216 ², 216 ³ and 216 ⁴ are then reformed into output feature maps 206 ¹, 206 ², 206 ³ and 206 ⁴ (each 4×4), respectively.

Converted weight matrix 212 includes converted weight sets 212 ¹, 212 ², 212 ³ and 212 ⁴. Converted weight set 212 ¹ includes the elements of filter 202 ¹, i.e., w¹ ₁, w¹ ₂, w¹ ₃ and w¹ ₄ arranged in a single (first) row. Converted weight set 212 ² includes the elements of filter 202 ², i.e., w² ₁, w² ₂, w² ₃ and w² ₄ flattened into a single (second) row. Converted weight set 212 ³ includes the elements of filter 202 ³, i.e., w³ ₁, w³ ₂, w³ ₃ and w³ ₄ flattened into a single (third) row. Converted weight set 212 ⁴ includes the elements of filter 202 ⁴, i.e., w⁴ ₁, w⁴ ₂, w⁴ ₃ and w⁴ ₄ flattened into a single (fourth) row.

Converted input data matrix 214 includes the elements of input feature map 204 recast as a larger matrix that implements the convolution operation as a simple matrix multiplication operation. Due to the mechanics of the convolution operation (discussed above), certain elements of input feature map 204 are duplicated once, twice or three times to generate converted output data matrix 216. Generally, each row of converted weight matrix 212 is a filter, each column of converted input data matrix 214 is a block of input data upon which each filter operates, and each dot product calculation, i.e., the multiplication of each row by each column, generates a different element of converted output data matrix 216.

For ease of discussion, converted input data matrix 214 may be divided into four portions or quadrants, i.e., a_(q1), a_(q2), a_(q3) and a_(q4), and converted output data matrix 216 may be divided into four portions or quadrants, i.e., o_(q1), o_(q2), o_(q3) and o_(q4).

The first quadrant a_(q1) of converted input data matrix 214 includes the four blocks of the first quadrant of input feature map 204, each block arranged as a column. Similarly, the second quadrant a_(q2) of converted input data matrix 214 includes the four blocks of the second quadrant of input feature map 204, each block arranged as a column. The third quadrant a_(q3) of converted input data matrix 214 includes the four blocks of the third quadrant of input feature map 204, each block arranged as a column. And, the fourth quadrant a_(q4) of converted input data matrix 214 includes the four blocks of the fourth quadrant of input feature map 204, each block arranged as a column.

More particularly, the first column of the first quadrant a_(q1) of converted input data matrix 214 includes elements a₁, a₂, a₆ and a₇, which are the same elements in the same sequence (i.e., row-major order) as the first block of the first quadrant of input feature map 204. The second column of the first quadrant a_(q1) of converted input data matrix 214 includes elements a₂, a₃, a₇ and a₈, which are the same elements in the same sequence (i.e., row-major order) as the second block of the first quadrant of input feature map 204. The third column of the first quadrant a_(q1) of converted input data matrix 214 includes elements a₃, a₄, a₈ and a₉, which are the same elements in the same sequence (i.e., row-major order) as the third block of the first quadrant of input feature map 204. The fourth column of the first quadrant a_(q1) of converted input data matrix 214 includes elements a₄, a₅, a₉ and a₁₀, which are the same elements in the same sequence (i.e., row-major order) as the fourth block of the first quadrant of input feature map 204. And so on for quadrants a_(q2), a_(q3) and a_(q4) of converted input data matrix 214.

The first row of the first quadrant o_(q1) of converted output data matrix 216 includes elements o¹ ₁, o¹ ₂, o¹ ₃ and o¹ ₄, which are the same elements in the same sequence as the first row of the first quadrant of output feature map 206 ¹. The second row of the first quadrant ow of converted output data matrix 216 includes elements o² ₁, o² ₂, o² ₃ and o² ₄, which are the same elements in the same sequence as the first row of the first quadrant of output feature map 206 ². The third row of the first quadrant ow of converted output data matrix 216 includes elements o³ ₁, o³ ₂, o³ ₃ and o³ ₄, which are the same elements in the same sequence as the first row of the first quadrant of output feature map 206 ³. The fourth row of the first quadrant o_(q1) of converted output data matrix 216 includes elements o⁴ ₁, o⁴ ₂, o⁴ ₃ and o⁴ ₄, which are the same elements in the same sequence as the first row of the first quadrant of output feature map 206 ⁴. And so on for quadrants o_(q2), o_(q3) and o_(q4) of converted output data matrix 216.

To generate the first quadrant o_(q1) of converted output data matrix 216, converted weight matrix 212 and the first quadrant a_(q1) of converted input data matrix 214 are multiplied together. For the first row of the first quadrant o_(q1), element o¹ ₁ is the dot product of the first row of converted weight matrix 212 and the first column of converted input data matrix 214, i.e., o¹ ₁ is equal to w¹ ₁·a₁+w¹ ₂·a₂+w¹ ₃·a₆+w¹ ₄·a₇. Element o¹ ₂ is the dot product of the first row of converted weight matrix 212 and the second column of converted input data matrix 214, i.e., o¹ ₂ is equal to w¹ ₁·a₂+w¹ ₂ a₃+w¹ ₃·a₇+w¹ ₄·a₈. Element o¹ ₃ is the dot product of the first row of converted weight matrix 212 and the third column of converted input data matrix 214, i.e., o¹ ₃ is equal to w¹ ₁·a₃+w¹ ₂·a₄+w¹ ₃·a₈+w¹ ₄·a₉. Element o¹ ₄ is the dot product of the first row of converted weight matrix 212 and the fourth column of converted input data matrix 214, i.e., o¹ ₄ is equal to w¹ ₁·a₄+w¹ ₂·a₅+w¹ ₃·a₉+w¹ ₄·a₁₀. The elements of the second, third and fourth rows the first quadrant o_(q1) of converted output data matrix 216, i.e., elements o² ₁, o² ₂, o² ₃, o² ₄, o³ ₁, o³ ₂, o³ ₃, o³ ₄, o⁴ ₁, o⁴ ₂, o⁴ ₃ and o⁴ ₄, are calculated in the same manner using the second, third and fourth rows of converted weight matrix 212, respectively.

To generate the second quadrant o_(q2) of converted output data matrix 216, converted weight matrix 212 and the second quadrant a_(q2) of converted input data matrix 214 are multiplied together. For the first row of the second quadrant o_(q2), element o¹ ₅ is the dot product of the first row of converted weight matrix 212 and the fifth column of converted input data matrix 214, i.e., o¹ ₅ is equal to w¹ ₁·a₆+w¹ ₂·a₇+w¹ ₃·a₁₁+w¹ ₄·a₁₂. Element o¹ ₆ is the dot product of the first row of converted weight matrix 212 and the sixth column of converted input data matrix 214, i.e., o¹ ₆ is equal to w¹ ₁·a₇+w¹ ₂·a₈+w¹ ₃·a₁₂+w¹ ₄·a₁₃. Element o¹ ₇ is the dot product of the first row of converted weight matrix 212 and the seventh column of converted input data matrix 214, i.e., o¹ ₇ is equal to w¹ ₁·a₈+w¹ ₂·a₉+w¹ ₃·a₁₃+w¹ ₄·a₁₄. Element o¹ ₈ is the dot product of the first row of converted weight matrix 212 and the eighth column of converted input data matrix 214, i.e., o¹ ₈ is equal to w¹ ₁·a₉+w¹ ₂·a₁₀+w¹ ₃·a₁₄+w¹ ₄·a₁₅. The elements of the second, third and fourth rows the second quadrant o_(q2) of converted output data matrix 216, i.e., elements o² ₅, o² ₆, o² ₇, o² ₈, o³ ₅, o³ ₆, o³ ₇, o³ ₈, o⁴ ₅, o⁴ ₆, o⁴ ₇ and o⁴ ₈, are calculated in the same manner using the second, third and fourth rows of converted weight matrix 212, respectively.

To generate the third quadrant o_(q3) of converted output data matrix 216, converted weight matrix 212 and the third quadrant a_(q3) of converted input data matrix 214 are multiplied together. For the first row of the third quadrant o_(q3), element o¹ ₉ is the dot product of the first row of converted weight matrix 212 and the ninth column of converted input data matrix 214, i.e., o¹ ₉ is equal to w¹ ₁·a₁₁+w¹ ₂·a₁₂+w¹ ₃·a₁₆+w¹ ₄·a₁₇. Element o¹ ₁₀ is the dot product of the first row of converted weight matrix 212 and the 10^(th) column of converted input data matrix 214, i.e., o¹ ₉ is equal to w¹ ₁·a₁₂+w¹ ₂·a₁₃+w¹ ₃·a₁₇+w¹ ₄·a₁₈. Element o¹ ₁₁ is the dot product of the first row of converted weight matrix 212 and the 11^(th) column of converted input data matrix 214, i.e., o¹ ₁₁ is equal to w¹ ₁·a₁₃+w¹ ₂·a₁₄+w¹ ₃·a₁₈+w¹ ₄·a₁₉. Element o¹ ₁₂ is the dot product of the first row of converted weight matrix 212 and the 12^(th) column of converted input data matrix 214, i.e., o¹ ₁₂ is equal to w¹ ₁·a₁₄+w¹ ₂·a₁₅+w¹ ₃·a₁₉+w¹ ₄·a₂₀. The elements of the second, third and fourth rows the third quadrant o_(q3) of converted output data matrix 216, i.e., elements o² ₉, o² ₁₀, o² ₁₁, o² ₁₂, o³ ₉, o³ ₁₀, o³ ₁₁, o³ ₁₂, o⁴ ₉, o⁴ ₁₀, o⁴ ₁₁ and o⁴ ₁₂, are calculated in the same manner using the second, third and fourth rows of converted weight matrix 212, respectively.

To generate the fourth quadrant o_(q4) of converted output data matrix 216, converted weight matrix 212 and the fourth quadrant a_(q4) of converted input data matrix 214 are multiplied together. For the first row of the fourth quadrant o_(q4), element o¹ ₁₃ is the dot product of the first row of converted weight matrix 212, and the 13^(th) column of converted input data matrix 214, i.e., o¹ ₁₃ is equal to w¹ ₁·a₁₆+w¹ ₂·a₁₇+w¹ ₃·a₂₁+w¹ ₄·a₂₂. Element o¹ ₁₄ is the dot product of the first row of converted weight matrix 212 and the 14^(th) column of converted input data matrix 214, i.e., o¹ ₁₄ is equal to w¹ ₁·a₁₇+w¹ ₂·a₁₈+w¹ ₃·a₂₂+w¹ ₄·a₂₃. Element o¹ ₁₅ is the dot product of the first row of converted weight matrix 212 and the 15^(th) column of converted input data matrix 214, i.e., o¹ ₁₅ is equal to w¹ ₁·a₁₈+w¹ ₂·a₁₉+w¹ ₃·a₂₃+w¹ ₄·a₂₄. Element o¹ ₁₆ is the dot product of the first row of converted weight matrix 212 and the 16^(th) column of converted input data matrix 214, i.e., o¹ ₁₆ is equal to w¹ ₁·a₁₉+w¹ ₂·a₂₀+w¹ ₃·a₂₄+w¹ ₄·a₂₅. The elements of the second, third and fourth rows the fourth quadrant o_(q4) of converted output data matrix 216, i.e., elements o² ₁₃, o² ₁₄, o² ₁₅, o² ₁₆, o³ ₁₃, o³ ₁₄, o³ ₁₅, o³ ₁₆, o⁴ ₁₃, o⁴ ₁₄, o⁴ ₁₅ and o⁴ ₁₆, are calculated in the same manner using the second, third and fourth rows of converted weight matrix 212, respectively.

FIG. 3A depicts a data flow diagram 300 for digital MAC array 318.

GEMM operations may be implemented in a dedicated ANN hardware accelerator that includes an array 318 of digital MAC units 310 that perform digital MAC operations. In this embodiment, digital MAC array 318 is a systolic, output stationary array that implements converted convolution operation 210 using a 4×4 array of digital MAC units 310.m ₁, . . . , 310.m ₁₆. The orientation of transposed weight matrix 312, transposed input data matrix 314, and transposed output data matrix 316 relative to digital MAC array 318 simplifies illustration; other orientations are also contemplated.

As discussed above, each digital MAC unit 310 calculates a dot product, between a row of converted weight matrix 212 and a column of converted input data matrix 214, to generate an element of converted output data matrix 216. Generally, a digital MAC unit includes, inter alia, a multiplier, an adder and a storage register. Each digital MAC unit is reset by clearing or zeroing its storage register prior to, or at the start of, a new dot product calculation.

Generally, elements from converted weight matrix 212 are read from local memory, enter digital MAC array 318 at the first row of digital MAC units 310, i.e., 310.m ₁, 310.m ₂, 310.m ₃ and 310.m ₄, and propagate one digital MAC unit down at the beginning of each processing cycle. Similarly, elements from converted input data matrix 214 are read from local memory, enter digital MAC array 318 at the first column of digital MAC units 310, i.e., 310.m ₁, 310.m ₅, 310.m ₉ and 310.m ₁₃, and propagate one digital MAC unit to the right at the beginning of each processing cycle.

The dot product calculations performed by digital MAC unit 310.m ₁ for the first quadrant a_(q1) of converted input data matrix 214 is discussed in detail below, while the dot product calculations performed by the remaining digital MAC units 310 of digital MAC array 318 for the first quadrant a_(q1) of converted input data matrix 214 are summarized below.

Digital MAC unit 310.m ₁ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the first column of converted input data matrix 214 to generate element o¹ ₁ of converted output data matrix 216. During the first processing cycle, digital MAC unit 310.m ₁ receives a₁ and w¹ ₁ from local memory, multiplies a₁ and w¹ ₁ to generate an intermediate product, adds the intermediate product to the value stored in the storage register (i.e., 0), and stores the accumulated result back in the storage register. During the second processing cycle, digital MAC unit 310.m ₁ transmits a₁ to digital MAC unit 310.m ₂ and w¹ ₁ to digital MAC unit 310.m ₅, receives a₂ and w¹ ₂ from local memory, multiplies a₂ and w¹ ₂ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, and stores the accumulated result back in the storage register.

During the third processing cycle, digital MAC unit 310.m ₁ transmits a₂ to digital MAC unit 310.m ₂ and w¹ ₂ to digital MAC unit 310.m ₅, receives a³ and w¹ ₃ from local memory, multiplies a₃ and w¹ ₃ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, and stores the accumulated result back in the storage register. During the fourth processing cycle, digital MAC unit 310.m ₁ transmits a³ to digital MAC unit 310.m ₂ and w¹ ₃ to digital MAC unit 310.m ₅, receives a₄ and w¹ ₄ from the local memory, multiplies a₄ and w¹ ₄ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, stores the accumulated result back in the storage register, and then outputs the value stored in the storage register as element o¹ ₁. During the fifth processing cycle, digital MAC unit 310.m ₁ transmits a₄ to digital MAC unit 310.m ₂ and w¹ ₄ to digital MAC unit 310.m ₅, and then waits for the next sequence of operations to begin.

The remainder of the first row of digital MAC array 318 includes digital MAC units 310.m ₂, 310.m ₃ and 310.m ₄.

After an initial delay of one processing cycle, digital MAC unit 310.m ₂ receives weights from the first delay register ff₁ and input data from digital MAC unit 310.m ₁, transmits weights to digital MAC unit 310.m ₆ and input data to digital MAC unit 310.m ₃, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the first column of converted input data matrix 214 to generate element o¹ ₂ of converted output data matrix 216. The initial delay of one processing cycle allows the delay pipeline (i.e., delay register ff₁) to be filled with weights transferred from memory, and the input data to become available from digital MAC unit 310.m ₁. At the end of the fifth processing cycle, digital MAC unit 310.m ₂ outputs element o¹ ₂.

After an initial delay of two processing cycles, digital MAC unit 310.m ₃ receives weights from the second delay register ff₂ and input data from digital MAC unit 310.m ₂, transmits weights to digital MAC unit 310.m ₇ and input data to digital MAC unit 310.m ₄, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the first column of converted input data matrix 214 to generate element o¹ ₃ of converted output data matrix 216. The initial delay of two processing cycles allows the delay pipeline (i.e., delay registers ff₁ and ff₂) to be filled with weights transferred from memory, and the input data to become available from digital MAC unit 310.m ₂. At the end of the sixth processing cycle, digital MAC unit 310.m ₃ outputs element o¹ ₃.

After an initial delay of three processing cycles, digital MAC unit 310.m ₄ receives weights from the third delay register ff₃ and input data from digital MAC unit 310.m ₃, transmits weights to digital MAC unit 310.m ₈, and calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the first column of converted input data matrix 214 to generate element o¹ ₄ of converted output data matrix 216. The initial delay of three processing cycles allows the delay pipeline (i.e., delay registers ff₁, ff₂ and ff₃) to be filled with weights transferred from memory, and the input data to become available from digital MAC unit 310.m ₃. At the end of the seventh processing cycle, digital MAC unit 310.m ₄ outputs element o¹ ₄.

The second row of digital MAC array 318 includes digital MAC units m₅, m₆, m₇ and m₈.

After an initial delay of one processing cycle, digital MAC unit 310.m ₅ receives weights from digital MAC unit 310.m ₁ and input data from a first delay register ff₁, transmits weights to digital MAC unit 310.m ₉ and input data to digital MAC unit 310.m ₆, and calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the second column of converted input data matrix 214 to generate element o² ₁ of converted output data matrix 216. The initial delay of one processing cycle allows the delay pipeline (i.e., delay register ff₁) to be filled with input data transferred from memory, and the weights to become available from digital MAC unit 310.m ₁. At the end of the fifth processing cycle, digital MAC unit 310.m ₅ outputs element o² ₁.

After an initial delay of two processing cycles, digital MAC unit 310.m ₆ receives weights from digital MAC unit 310.m ₂ and input data from digital MAC unit 310.m ₅, transmits weights to digital MAC unit 310.m ₁₀ and input data to digital MAC unit 310.m ₇, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the second column of converted input data matrix 214 to generate element o² ₂ of converted output data matrix 216. The initial delay of two processing cycles allows the weights to become available from digital MAC unit 310.m ₂, and the input data to become available from digital MAC unit 310.m ₅. At the end of the sixth processing cycle, digital MAC unit 310.m ₆ outputs element o² ₂.

After an initial delay of three processing cycles, digital MAC unit 310.m ₇ receives weights from digital MAC unit 310.m ₃ and input data from digital MAC unit 310.m ₆, transmits weights to digital MAC unit 310.m ₁₁ and input data to digital MAC unit 310.m ₈, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the second column of converted input data matrix 214 to generate element o² ₃ of converted output data matrix 216. The initial delay of three processing cycles allows the weights to become available from digital MAC unit 310.m ₃, and the input data to become available from digital MAC unit 310.m ₆. At the end of the seventh processing cycle, digital MAC unit 310.m ₇ outputs element o² ₃.

After an initial delay of four processing cycles, digital MAC unit 310.m ₈ receives weights from digital MAC unit 310.m ₄ and input data from digital MAC unit 310.m ₇, transmits weights to digital MAC unit 310.m ₁₂, and calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the second column of converted input data matrix 214 to generate element o² ₄ of converted output data matrix 216. The initial delay of four processing cycles allows the weights to become available from digital MAC unit 310.m ₄, and the input data to become available from digital MAC unit 310.m ₇. At the end of the eighth processing cycle, digital MAC unit 310.m ₈ outputs element o² ₄.

The third row of digital MAC array 318 includes digital MAC units m₉, m₁₀, m₁₁ and m₁₂.

After an initial delay of two processing cycles, digital MAC unit 310.m ₉ receives weights from digital MAC unit 310.m ₅ and input data from a second delay register ff₂, transmits weights to digital MAC unit 310.m ₁₃ and input data to digital MAC unit 310.m ₁₀, and calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the third column of converted input data matrix 214 to generate element o³ ₁ of converted output data matrix 216. The initial delay of two processing cycles allows the delay pipeline (i.e., delay registers ff₁ and ff₂) to be filled with input data transferred from memory, and the weights to become available from digital MAC unit 310.m ₅. At the end of the sixth processing cycle, digital MAC unit 310.m ₉ outputs element o³ ₁.

After an initial delay of three processing cycles, digital MAC unit 310.m ₁₀ receives weights from digital MAC unit 310.m ₆ and input data from digital MAC unit 310.m ₉, transmits weights to digital MAC unit 310.m ₁₄ and input data to digital MAC unit 310.m ₁₁, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the third column of converted input data matrix 214 to generate element o³ ₂ of converted output data matrix 216. The initial delay of three processing cycles allows the weights to become available from digital MAC unit 310.m ₆, and the input data to become available from digital MAC unit 310.m ₉. At the end of the seventh processing cycle, digital MAC unit 310.m ₁₀ outputs element o³ ₂.

After an initial delay of four processing cycles, digital MAC unit 310.m ₁₁ receives weights from digital MAC unit 310.m ₇ and input data from digital MAC unit 310.m ₁₀, transmits weights to digital MAC unit 310.m ₁₅ and input data to digital MAC unit 310.m ₁₂, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the third column of converted input data matrix 214 to generate element o³ ₃ of converted output data matrix 216. The initial delay of four processing cycles allows the weights to become available from digital MAC unit 310.m ₇, and the input data to become available from digital MAC unit 310.m ₁₀. At the end of the eighth processing cycle, digital MAC unit 310.m ₁₁ outputs element o³ ₃.

After an initial delay of five processing cycles, digital MAC unit 310.m ₁₂ receives weights from digital MAC unit 310.m ₈ and input data from digital MAC unit 310.m ₁₁, transmits weights to digital MAC unit 310.m ₁₆, and calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the third column of converted input data matrix 214 to generate element o³ ₄ of converted output data matrix 216. The initial delay of five processing cycles allows the weights to become available from digital MAC unit 310.m ₈, and the input data to become available from digital MAC unit 310.m ₁₁. At the end of the ninth processing cycle, digital MAC unit 310.m ₁₂ outputs element o³ ₄.

The fourth row of digital MAC array 318 includes digital MAC units m₁₃, m₁₄, m₁₅ and m₁₆.

After an initial delay of three processing cycles, digital MAC unit 310.m ₁₃ receives weights from digital MAC unit 310.m ₉ and input data from a third delay register ff₃, transmits input data to digital MAC unit 310.m ₁₄, and calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the fourth column of converted input data matrix 214 to generate element o⁴ ₁ of converted output data matrix 216. The initial delay of three processing cycles allows the delay pipeline (i.e., delay registers ff₁, ff₂ and ff₃) to be filled with input data transferred from memory, and the weights to become available from digital MAC unit 310.m ₉. At the end of the seventh processing cycle, digital MAC unit 310.m ₁₃ outputs element o⁴ ₁.

After an initial delay of four processing cycles, digital MAC unit 310.m ₁₄ receives weights from digital MAC unit 310.m ₁₀ and input data from digital MAC unit 310.m ₁₃, transmits input data to digital MAC unit 310.m ₁₅, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the fourth column of converted input data matrix 214 to generate element o⁴ ₂ of converted output data matrix 216. The initial delay of four processing cycles allows the weights to become available from digital MAC unit 310.m ₁₀, and the input data to become available from digital MAC unit 310.m ₁₃. At the end of the eighth processing cycle, digital MAC unit 310.m ₁₄ outputs element o⁴ ₂.

After an initial delay of five processing cycles, digital MAC unit 310.m ₁₅ receives weights from digital MAC unit 310.m ₁₁ and input data from digital MAC unit 310.m ₁₄, transmits input data to digital MAC unit 310.m ₁₆, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the fourth column of converted input data matrix 214 to generate element o⁴ ₃ of converted output data matrix 216. The initial delay of five processing cycles allows the weights to become available from digital MAC unit 310.m ₁₁, and the input data to become available from digital MAC unit 310.m ₁₄. At the end of the ninth processing cycle, digital MAC unit 310.m ₁₅ outputs element o⁴ ₃.

After an initial delay of six processing cycles, digital MAC unit 310.m ₁₆ receives weights from digital MAC unit 310.m ₁₂ and input data from digital MAC unit 310.m ₁₅, and calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the fourth column of converted input data matrix 214 to generate element o⁴ ₄ of converted output data matrix 216. The initial delay of six processing cycles allows the weights to become available from digital MAC unit 310.m ₁₂, and the input data to become available from digital MAC unit 310.m ₁₅. At the end of the tenth processing cycle, digital MAC unit 310.m ₁₆ outputs element o⁴ ₄.

After the first quadrant a_(q1) of converted input data matrix 214 has been processed, the next sequence of operations may begin in order to process the second quadrant a_(q2) of converted input data matrix 214. After the second quadrant a_(q2) of converted input data matrix 214 has been processed, the next sequence of operations may begin in order to process the third quadrant a_(q3) of converted input data matrix 214. And, after the third quadrant a_(q3) of converted input data matrix 214 has been processed, the final sequence of operations may begin in order to process the fourth quadrant a_(q4) of converted input data matrix 214. Converted weight matrix 212 is accessed for each sequence of operations.

In one embodiment, digital MAC array 318 may wait until the final element o⁴ ₄ of converted output data matrix 216 has been calculated at the end of the 10^(th) processing cycle before beginning the next sequence of operations for the next quadrant of converted input data matrix 214. In another embodiment, digital MAC array 318 may begin the next sequence of operations for the next quadrant of converted input data matrix 214 as soon as the first element o¹ ₁ of converted output data matrix 216 has been calculated at the end of the 4^(th) processing cycle. In this embodiment, digital MAC array 318 does not wait or suspend operations; instead, digital MAC array 318 continuously performs dot product calculations.

Each column of converted weight matrix 212 is read at the beginning of a processing cycle. The first column of converted weight matrix 212, i.e., weights w¹ ₁, w² ₁, w³ ₁ and w⁴ ₁, is read at the beginning of the first processing cycle; w¹ ₁ is provided to digital MAC unit 310.m ₁, w² ₁ is provided to the first delay register for digital MAC unit 310.m ₂, w³ ₁ is provided to the first delay register ff₁ for digital MAC unit 310.m ₃, and w⁴ ₁ is provided to the first delay register ff₁ for digital MAC unit 310.m ₄. Similarly, the second column of converted weight matrix 212, i.e., weights w¹ ₂, w² ₂, w³ ₂ and w⁴ ₂, is read at the beginning of the second processing cycle; w¹ ₂ is provided to digital MAC unit 310.m ₁, w² ₂ is provided to the first delay register ff₁ for digital MAC unit 310.m ₂, w³ ₂ is provided to the first delay register ff₁ for digital MAC unit 310.m ₃, and w⁴ ₂ is provided to the first delay register ff₁ for digital MAC unit 310.m ₄.

The third column of converted weight matrix 212, i.e., weights w¹ ₃, w² ₃, w³ ₃ and w⁴ ₃, is read at the beginning of the third processing cycle; w¹ ₃ is provided to digital MAC unit 310.m ₁, w² ₃ is provided to the first delay register ff₁ for digital MAC unit 310.m ₂, w³ ₃ is provided to the first delay register ff₁ for digital MAC unit 310.m ₃, and w⁴ ₃ is provided to the first delay register ff₁ for digital MAC unit 310.m ₄. And, the fourth column of converted weight matrix 212, i.e., weights w¹ ₄, w² ₄, w³ ₄ and w⁴ ₄, is read at the beginning of the fourth processing cycle; w¹ ₄ is provided to digital MAC unit 310.m ₁, w² ₄ is provided to the first delay register ff₁ for digital MAC unit 310.m ₂, w³ ₄ is provided to the first delay register ff₁ for digital MAC unit 310.m ₃, and w⁴ ₄ is provided to the first delay register ff₁ for digital MAC unit 310.m ₄.

Similarly, each row of a particular quadrant of converted input data matrix 214 is read at the beginning of a processing cycle. For example, the first row of the first quadrant a_(q1) of converted input data matrix 214, i.e., elements a₁, a₂, a₃ and a₄, is read at the beginning of the first processing cycle; a₁ is provided to digital MAC unit 310.m ₁, a₂ is provided to the first delay register ff₁ for digital MAC unit 310.m ₅, a₃ is provided to the first delay register ff₁ for digital MAC unit 310.m ₉, and a₄ is provided to the first delay register ff₁ for digital MAC unit 310.m ₁₃. Similarly, the second row of the first quadrant a_(q1) of converted input data matrix 214, i.e., elements a₂, a₃, a₄ and a₅, is read at the beginning of the second processing cycle; a₂ is provided to digital MAC unit 310.m ₁, a₃ is provided to the first delay register ff₁ for digital MAC unit 310.m ₅, a₄ is provided to the first delay register ff₁ for digital MAC unit 310.m ₉, and a₅ is provided to the first delay register ff₁ for digital MAC unit 310.m ₁₃.

The third row of the first quadrant a_(q1) of converted input data matrix 214, i.e., elements a₆, a₇, a₈ and a₉, is read at the beginning of the third processing cycle; a₆ is provided to digital MAC unit 310.m ₁, a₇ is provided to the first delay register ff₁ for digital MAC unit 310.m ₅, a₈ is provided to the first delay register ff₁ for digital MAC unit 310.m ₉, and a₉ is provided to the first delay register ff₁ for digital MAC unit 310.m ₁₃. And, the fourth row of the first quadrant a_(q1) of converted input data matrix 214, i.e., elements a₇, a₈, a₉ and a₁₀, is read at the beginning of the fourth processing cycle; a₇ is provided to digital MAC unit 310.m ₁, a₈ is provided to the first delay register ff₁ for digital MAC unit 310.m ₅, a₉ is provided to the first delay register ff₁ for digital MAC unit 310.m ₉, and an is provided to the first delay register ff₁ for digital MAC unit 310.m ₁₃.

FIG. 3B depicts a data flow diagram 320 for analog NVM crossbar array 328.

GEMM operations may be implemented in a dedicated ANN hardware accelerator that includes digital processing circuitry and one or more analog NVM crossbar arrays 328 that perform analog MAC operations. The digital processing circuitry includes digital-to-analog converters (DACs) 330 and analog-to-digital converters (ADCs) 340. The analog NVM crossbar array 328 includes analog NVM cells 350, each of which includes one or more programmable analog NVM elements, such as, for example, phase change memory (PCM), resistive random access memory (RRAM), magnetic RAM (MRAM), correlated electron RAM (CeRAM), etc. This approach relies on a combination of Ohm's law and Kirchhoff's current law to implement analog MAC operations in parallel.

According to Ohm's law, the application of a voltage across an analog NVM cell generates a current that is proportional to the voltage across the analog NVM cell divided by the programmed resistance of the analog NVM cell. Since conductance (in siemens) is the reciprocal of resistance (in ohms), the application of a voltage across an analog NVM cell generates a current that is proportional to the product of the conductance of the analog NVM cell and the voltage across the analog NVM cell. According to Kirchhoff's current law, currents from analog NVM cells in the same column of an analog NVM crossbar array combine to generate an accumulated current (i.e., the sum of the products generated by the analog NVM cells in that column, i_(out)). Thus, an analog NVM crossbar array implements analog MAC operations in parallel through the combination of Ohm's law and Kirchhoff's current law. The analog NVM element(s) within each analog NVM cell are programmed to a discrete conductance level that represents a weight value.

In data flow diagram 320, converted convolution operation 210 is implemented by DACs 330, ADCs 340 and analog NVM crossbar array 328. Analog NVM crossbar array 328 includes four row signal lines 332, i.e., row signal lines 332 ¹, 332 ², 332 ³ and 332 ⁴, four column signal lines 342, i.e., column signal lines 342 ¹, 342 ², 342 ³ and 342 ⁴, and sixteen analog NVM cells 350, one disposed at each intersection of row signal lines 332 and column signal lines 342. DACs 330 are coupled to row signal lines 332, and ADCs 340 are coupled to column signal lines 342. Each column of analog NVM cells 350 calculates a dot product, between a row of converted weight matrix 212 and a column of converted input data matrix 214, to generate an element of converted output data matrix 216.

Each analog NVM cell 350 includes one or more analog NVM resistive switching elements which have a low-resistance state (LRS), e.g., Ron, and a high-resistance state (HRS), e.g., R_(off). Due to the wide separation between the LRS and the HRS, each analog NVM cell may be programmed to encode a discrete, linearly-separated conductance value. As depicted in data flow diagram 320, weights w¹ ₁, w¹ ₂, w¹ ₃, w¹ ₄, w² ₁, w² ₂, w² ₃, w² ₄, w³ ₁, w³ ₂, w³ ₃, w³ ₄, w⁴ ₁, w⁴ ₂, w⁴ ₃, and w⁴ ₄ (i.e., weight matrix 322) have been converted to conductances g¹ ₁, g¹ ₂, g¹ ₃, g¹ ₄, g² ₁, g² ₂, g² ₃, g² ₄, g³ ₁, g³ ₂, g³ ₃, g³ ₄, g⁴ ₁, g⁴ ₂, g⁴ ₃, and g⁴ ₄ (i.e., conductance matrix 323), and analog NVM cells 350 have been programmed with these conductance values.

Column data from converted input data matrix 214 are sequentially provided as input data matrix 324 to DACs 330, which output respective analog voltages v₁, v₂, v₃ and v₄ along row signal lines 332 across analog NVM crossbar array 328. Column signal lines 342 convey four corresponding bitline (BL) signals, i.e., BL₁, BL₂, BL₃, and BL₄, whose currents (i.e., i₁, i₂, i₃, and i₄) are proportional to the accumulated dot-products of input data matrix 324 (i.e., v₁, v₂, v₃ and v₄) and conductances g¹ ₁, g¹ ₂, g¹ ₃, g¹ ₄, g² ₁, g² ₂, g² ₃, g² ₄, g³ ₂, g³ ₃, g³ ₄, g⁴ ₂, g⁴ ₃, and g⁴ ₄ along column signal lines 342. The BL signals are then digitized using ADCs 340 to generate output data 326, i.e., the columns of converted output data matrix 216.

Converted output data matrix 216 has four rows, each of which is reformed into a separate output feature map 206 ₁, 206 ₂, 206 ₃ and 206 ₄. In order to calculate the first elements of each row of the converted output data matrix, i.e., element o¹ ₁ of output feature map 206 ¹, element o² ₁ of output feature map 206 ², element o³ ₁ of output feature map 206 ³ and element o⁴ ₁ of output feature map 206 ⁴, input data from the first column of converted input data matrix 214, i.e., activations a₁, a₂, a₆ and a₇, are input to DACs 330. DACs 330 then output respective analog voltages across analog NVM crossbar array 328 along row signal lines 332. Column signal lines 342 convey the BL signals to ADCs 340, which digitize the BL signals to obtain the first elements of the rows of the converted output data matrix. The remaining elements of the converted output data matrix are calculated in a similar manner.

An activation function and a bias may be applied to each element of the converted output data matrix to generate the elements of output feature maps 206, which are then provided as input feature maps 204 to the next layer. The activation function and bias may be applied after each element of the converted output data matrix is calculated, after all of the elements of the converted output data matrix are calculated, or by a subsequent locally-connected layer, such as an ReLU layer.

Embodiments of the present disclosure advantageously improve the distribution of operands in ANN accelerators by eliminating the need for flip-flops, ADCs and DACs. More particularly, digital operands are encoded as analog signals using digitally-controlled oscillators (DCOs) and digital-to-time converters (DTCs), and then distributed for processing by an array of mixed-signal (i.e., analog and digital) MAC units. Advantageously, analog operand distribution reduces hardware cost because a single wire can encode a continuous value which could represent as much data as, for example, an 8-wire digital signal, while the digital value that is output by each element of the mixed-signal MAC array avoids expensive ADCs.

FIG. 4 depicts a data flow diagram 400 for a mixed-signal MAC array 418, in accordance with an embodiment of the present disclosure.

Advantageously, GEMM operations may be implemented in a dedicated ANN hardware accelerator that includes digital processing circuitry and one or more mixed-signal MAC arrays 418 that perform output stationary, mixed-signal MAC operations. In this embodiment, mixed-signal MAC array 418 is an output stationary array that implements converted convolution operation 210 using a 4×4 array of mixed-signal MAC units 440.M₁, . . . , 440.M₁₆. Other embodiments have different array dimensions, such as, for example, 2×2, 3×3, 5×5, etc., as well as asymmetric arrays, such as, for example, 1×2, 2×3, 3×2, 2×4, 4×2, etc. The orientation of transposed weight matrix 412, transposed input data matrix 414, and output data matrix 416 relative to mixed-signal MAC array 418 simplifies illustration; other orientations are also contemplated. Each mixed-signal MAC unit 440 calculates a dot product, between a row of converted weight matrix 212 and a column of converted input data matrix 214, to generate an element of converted output data matrix 216.

The digital processing circuitry includes digital-to-time converters (DTCs) 420, i.e., DTC 420 ¹, 420 ², 420 ³ and 420 ⁴, and digitally-controlled oscillators (DCOs) 430, i.e., DCO 430 ¹, 430 ², 430 ³ and 430 ⁴ to generate analog operand signals. Mixed-signal MAC array 418 includes four row signal lines 422, i.e., row signal lines 422 ¹, 422 ², 422 ³ and 422 ⁴, four column signal lines 432, i.e., column signal lines 432 ¹, 432 ², 432 ³ and 432 ⁴, four sign signal lines associated with column signal lines 432, and sixteen mixed-signal MAC units 440.M₁, . . . , 440.M₁₆, one disposed at each intersection of row signal lines 422 and column signal lines 432. In certain embodiments, the weights are unsigned (e.g., unipolar or bipolar weights), and the mixed-signal MAC units 440 do not include sign signal lines.

DTCs 420 are coupled to row signal lines 422, while DCOs 430 are coupled to column signal lines 432. More particularly, DTC 420 ¹ is coupled to row signal line 422 ¹, DTC 420 ² is coupled to row signal line 422 ², DTC 420 ³ is coupled to row signal line 422 ³, and DTC 420 ⁴ is coupled to row signal line 422 ⁴. DCO 430 ¹ is coupled to column signal line 432 ¹, DCO 430 ² is coupled to column signal line 432 ², DCO 430 ³ is coupled to column signal line 432 ³, and DCO 430 ⁴ is coupled to column signal line 432 ⁴.

Mixed-signal MAC units 440.M₁, M₂, M₃, and M₄ are coupled to row signal line 422 ¹ and receive analog signal A₁. Mixed-signal MAC units 440.M₅, M₆, M₇, and M₈ are coupled to row signal line 422 ² and receive analog signal A₂. Mixed-signal MAC units 440.M₉, M₁₀, M₁₁, and M₁₂ are coupled to row signal line 422 ³ and receive analog signal A₃. Mixed-signal MAC units 440.M₁₃, M₁₄, M₁₅, and M₁₆ are coupled to row signal line 422 ⁴ and receive analog signal A₄. Similarly, mixed-signal MAC units M₁, M₅, M₉, and M₁₃ are coupled to column signal line 432 ¹ and the associated signal line, and receive analog signal W₁ and signal SGN₁. Mixed-signal MAC units M₂, M₆, M₁₀, and M₁₄ are coupled to column signal line 432 ² and the associated signal line, and receive analog signal W₂ and signal SGN₂. Mixed-signal MAC units M₃, M₇, M₁₁, and M₁₅ are coupled to column signal line 432 ³ and the associated signal line, and receive analog signal W₃ and signal SGN₃. Mixed-signal MAC units M4₃, M₈, M₁₂, and M₁₆ are coupled to column signal line 432 ⁴ and the associated signal line, and receive analog signal W₄ and signal SGN₄.

Generally, mixed-signal MAC unit 440 includes, inter alia, an integrated clock gate (ICG), a counter circuit and an output multiplexer. Each mixed-signal MAC unit 440 receives two analog operand signals and a sign signal, i.e., A_(i), W_(i) and SGN_(i), and multiplies signals W_(i) and A_(i) by incrementing or decrementing a count value based on the interaction of signals W_(i) and A_(i) determined by the ICG and the value of signal SGN_(i) (described in detail below). The count value is stored within a register in the counter circuit, and represents an intermediate value of the accumulated result until the dot product calculation is complete. The final count value is then output through the output multiplexer. This approach implements mixed-signal MAC operations in parallel. Each mixed-signal MAC unit 440 is reset by clearing or zeroing its counter circuit prior to, or at the start of, a new dot product calculation.

Column data from converted input data matrix 214 are sequentially provided as input data to DTCs 420, which convert the digital data to analog signals A, A₂, A₃ and A₄, which are output across mixed-signal MAC array 418 along row signal lines 422 ¹, 422 ², 422 ³ and 422 ⁴, respectively. Similarly, row data from converted weight matrix 212 are sequentially provided as weight data to DCOs 430, which convert the digital data to analog signals W₁, W₂, W₃ and W₄ and signals SGN₁, SGN₂, SGN₃ and SGN₄, which are output across mixed-signal MAC array 418 along column signal lines 432 ¹, 432 ², 432 ³ and 432 ⁴ and the four sign signal lines, respectively. Converted output data matrix 216 has four rows, each of which is reformed into a separate output feature map 206 ₁, 206 ₂, 206 ₃ and 206 ₄.

The dot product calculations performed by mixed-signal MAC units 440.M₁, 440.M₂, 440.M₃ and 440.M₄ for the first quadrant a_(q1) of converted input data matrix 214 are discussed in detail below, while the dot product calculations performed by the remaining mixed-signal MAC units 440 for the first quadrant a_(q1) are summarized below.

Mixed-signal MAC unit 440.M₁ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the first column of converted input data matrix 214 to generate element o¹ ₁ of converted output data matrix 216. Mixed-signal MAC unit 440.M₂ calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the first column of converted input data matrix 214 to generate element o² ₁ of converted output data matrix 216. Mixed-signal MAC unit 440.M₃ calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the first column of converted input data matrix 214 to generate element o³ ₁ of converted output data matrix 216. Mixed-signal MAC unit 440.M₄ calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the first column of converted input data matrix 214 to generate element o⁴ ₁ of converted output data matrix 216.

During the first processing cycle, DTC 420 ¹ receives element a₁ from memory, converts the digital value into analog signal A₁, and transmits analog signal A₁ along row signal line 422 ¹. DCO 430 ¹ receives element w¹ ₁ from memory, converts the digital value into analog signal W₁, and transmits analog signal W₁ along column signal line 432 ¹. DCO 430 ¹ also determines the sign of element w¹ ₁, i.e., positive or negative, and the generates and transmits the SGN₁ signal along the first sign signal line. DCO 430 ² receives element w² ₁ from memory, converts the digital value into analog signal W₂, and transmits analog signal W₂ along column signal line 432 ². DCO 430 ² also determines the sign of element w² ₁, and the generates and transmits the SGN₂ signal along the first sign signal line. DCO 430 ³ receives element w³ ₁ from memory, converts the digital value into analog signal W₃, and transmits analog signal W₃ along column signal line 432 ³. DCO 430 ³ also determines the sign of element w³ ₁, and the generates and transmits the SGN₃ signal along the first sign signal line. DCO 430 ⁴ receives element w⁴ ₁ from memory, converts the digital value into analog signal W₄, and transmits analog signal W₄ along column signal line 432 ⁴. DCO 430 ⁴ also determines the sign of element w⁴ ₁, and the generates and transmits the SGN₄ signal along the first sign signal line.

Mixed-signal MAC unit 440.M₁ receives analog signal A₁ along row signal line 422 ¹, analog signal W₁ along column signal line 432 ¹, and the signal SGN₁ along the first signal line, and multiplies W₁ (w¹ ₁) and A₁ (a₁) by incrementing or decrementing the counter circuit based on the interaction of signals W₁ and A₁ determined by the ICG and the value of the signal SGN₁. In one embodiment, the counter circuit is incremented when SGN₁ has a value of 1, and decremented when SGN₁ has a value of 0; other configurations are also contemplated. The counter circuit begins the first processing cycle with a value of zero, and ends the first processing cycle with a value equal to w¹ ₁·a₁, which is the first intermediate value of the accumulated result for the dot product calculation.

Mixed-signal MAC unit 440.M₂ receives analog signal A₁ along row signal line 422 ¹, analog signal W₂ along column signal line 432 ², and the signal SGN₂ along the second signal line, and multiplies W₂ (w² ₁) and A₁ (a₁) by incrementing or decrementing the counter circuit based on the interaction of signals W₂ and A₂ determined by the ICG and the value of the signal SGN₂. The counter circuit begins the first processing cycle with a value of zero, and ends the first processing cycle with a value equal to w² ₁·a₁.

Mixed-signal MAC unit 440.M₃ receives analog signal A₁ along row signal line 422 ¹, analog signal W₃ along column signal line 432 ³, and the signal SGN₃ along the third signal line, and multiplies W₃ (w³ ₁) and A₁ (a₁) by incrementing or decrementing the counter circuit based on the interaction of signals W₃ and A₃ determined by the ICG and the value of the signal SGN₃. The counter circuit begins the first processing cycle with a value of zero, and ends the first processing cycle with a value equal to w³ ₁·a₁.

Mixed-signal MAC unit 440.M₄ receives analog signal A₁ along row signal line 422 ¹, analog signal W₄ along column signal line 432 ⁴, and the signal SGN₄ along the fourth signal line, and multiplies W₄ (w⁴ ₁) and A₁ (a₁) by incrementing or decrementing the counter circuit based on the interaction of signals W₄ and A₄ determined by the ICG and the value of the signal SGN₄. The counter circuit begins the first processing cycle with a value of zero, and ends the first processing cycle with a value equal to w⁴ ₁·a₁.

During the second processing cycle, DTC 420 ¹ receives element a₂ from memory, converts the digital value into analog signal A₁, and transmits analog signal A₁ along row signal line 422 ¹. DCO 430 ¹ receives element w¹ ₂ from memory, converts the digital value into analog signal W₁, and transmits analog signal W₁ along column signal line 432 ¹. DCO 430 ¹ also determines the sign of element w¹ ₂, and the generates and transmits SGN₁ signal along the first sign signal line. DCO 430 ² receives element w² ₂ from memory, converts the digital value into analog signal W₂, and transmits analog signal W₂ along column signal line 432 ². DCO 430 ² also determines the sign of element w² ₂, and the generates and transmits SGN₂ signal along the first sign signal line. DCO 430 ³ receives element w³ ₂ from memory, converts the digital value into analog signal W₃, and transmits analog signal W₃ along column signal line 432 ³. DCO 430 ³ also determines the sign of element w³ ₂, and the generates and transmits SGN₃ signal along the first sign signal line. DCO 430 ⁴ receives element w⁴ ₂ from memory, converts the digital value into analog signal W₄, and transmits analog signal W₄ along column signal line 432 ⁴. DCO 430 ⁴ also determines the sign of element w⁴ ₂, and the generates and transmits SGN₄ signal along the first sign signal line.

Mixed-signal MAC unit 440.M₁ receives analog signal A₁ along row signal line 422 ¹, analog signal W₁ along column signal line 432 ¹, and the signal SGN₁ along the first signal line, and multiplies W₁ (w¹ ₂) and A₁ (a₂) by incrementing or decrementing the counter circuit based on the interaction of signals W₁ and A₁ determined by the ICG and the value of the signal SGN₁. The counter circuit begins the second processing cycle with a value of w¹ ₁·a₁, and ends the second processing cycle with a value equal to w¹ ₂·a₂+w¹ ₁·a₁, which is the second intermediate value of the accumulated result for the dot product calculation.

Mixed-signal MAC unit 440.M₂ receives analog signal A₁ along row signal line 422 ¹, analog signal W₂ along column signal line 432 ², and the signal SGN₂ along the second signal line, and multiplies W₂ (w² ₂) and A₁ (a₂) by incrementing or decrementing the counter circuit based on the interaction of signals W₂ and A₂ determined by the ICG and the value of the signal SGN₂. The counter circuit begins the second processing cycle with a value of w² ₁·a₁, and ends the second processing cycle with a value equal to w² ₂·a₂+w² ₁·a₁.

Mixed-signal MAC unit 440.M₃ receives analog signal A₁ along row signal line 422 ¹, analog signal W₃ along column signal line 432 ³, and the signal SGN₃ along the third signal line, and multiplies W₃ (w³ ₂) and A₁ (a₂) by incrementing or decrementing the counter circuit based on the interaction of signals W₃ and A₃ determined by the ICG and the value of the signal SGN₃. The counter circuit begins the second processing cycle with a value of w³ ₁·a₁, and ends the second processing cycle with a value equal to w³ ₂·a₂+w³ ₁·a₁.

Mixed-signal MAC unit 440. M₄ receives analog signal A₁ along row signal line 422 ¹, analog signal W₄ along column signal line 432 ⁴, and the signal SGN₄ along the fourth signal line, and multiplies W₄ (w⁴ ₂) and A₁ (a₂) by incrementing or decrementing the counter circuit based on the interaction of signals W₄ and A₄ determined by the ICG and the value of the signal SGN₄. The counter circuit begins the second processing cycle with a value of w⁴ ₁·a₁, and ends the second processing cycle with a value equal to w⁴ ₂·a₂+w⁴ ₁·a₁.

During the third processing cycle, DTC 420 ¹ receives element a₆ from memory, converts the digital value into analog signal A₁, and transmits analog signal A₁ along row signal line 422 ¹. DCO 430 ¹ receives element w¹ ₃ from memory, converts the digital value into analog signal W₁, and transmits analog signal W₁ along column signal line 432 ¹. DCO 430 ¹ also determines the sign of element w¹ ₃, and the generates and transmits SGN₁ signal along the first sign signal line. DCO 430 ² receives element w² ₃ from memory, converts the digital value into analog signal W₂, and transmits analog signal W₂ along column signal line 432 ². DCO 430 ² also determines the sign of element w² ₃, and the generates and transmits SGN₂ signal along the first sign signal line. DCO 430 ³ receives element w³ ₃ from memory, converts the digital value into analog signal W₃, and transmits analog signal W₃ along column signal line 432 ³. DCO 430 ³ also determines the sign of element w³ ₃, and the generates and transmits SGN₃ signal along the first sign signal line. DCO 430 ⁴ receives element w⁴ ₃ from memory, converts the digital value into analog signal W₄, and transmits analog signal W₄ along column signal line 432 ⁴. DCO 430 ⁴ also determines the sign of element w⁴ ₃, and the generates and transmits SGN₄ signal along the first sign signal line.

Mixed-signal MAC unit 440.M₁ receives analog signal A₁ along row signal line 422 ¹, analog signal W₁ along column signal line 432 ¹, and the signal SGN₁ along the first signal line, and multiplies W₁ (w¹ ₃) and A₁ (a₆) by incrementing or decrementing the counter circuit based on the interaction of signals W₁ and A₁ determined by the ICG and the value of the signal SGN₁. The counter circuit begins the third processing cycle with a value of w¹ ₂·a₂+w¹ ₁·a₁, and ends the third processing cycle with a value equal to w¹ ₃·a₆+w¹ ₂·a₂+w¹ ₁·a₁, which is the third intermediate value of the accumulated result for the dot product calculation.

Mixed-signal MAC unit 440.M₂ receives analog signal A₁ along row signal line 422 ¹, analog signal W₂ along column signal line 432 ², and the signal SGN₂ along the second signal line, and multiplies W₂ (w² ₃) and A₁ (a₆) by incrementing or decrementing the counter circuit based on the interaction of signals W₂ and A₂ determined by the ICG and the value of the signal SGN₂. The counter circuit begins the third processing cycle with a value of w² ₂·a₂+w² ₁·a₁, and ends the third processing cycle with a value equal to w² ₃·a₆+w² ₂·a₂+w² ₁·a₁.

Mixed-signal MAC unit 440.M₃ receives analog signal A₁ along row signal line 422 ¹, analog signal W₃ along column signal line 432 ³, and the signal SGN₃ along the third signal line, and multiplies W₃ (w³ ₃) and A₁ (a₆) by incrementing or decrementing the counter circuit based on the interaction of signals W₃ and A₃ determined by the ICG and the value of the signal SGN₃. The counter circuit begins the third processing cycle with a value of w³ ₂·a₂+w³ ₁·a₁, and ends the third processing cycle with a value equal to w³ ₃·a₆+w³ ₂·a₂+w³ ₁·a₁.

Mixed-signal MAC unit 440. M₄ receives analog signal A₁ along row signal line 422 ¹, analog signal W₄ along column signal line 432 ⁴, and the signal SGN₄ along the fourth signal line, and multiplies W₄ (w⁴ ₃) and A₁ (a₆) by incrementing or decrementing the counter circuit based on the interaction of signals W₄ and A₄ determined by the ICG and the value of the signal SGN₄. The counter circuit begins the third processing cycle with a value of w⁴ ₂·a₂+w⁴ ₁·a₁, and ends the third processing cycle with a value equal to w⁴ ₃·a₆+w⁴ ₂·a₂+w⁴ ₁·a₁.

During the fourth processing cycle, DTC 420 ¹ receives element a₇ from memory, converts the digital value into analog signal A₁, and transmits analog signal A₁ along row signal line 422 ¹. DCO 430 ¹ receives element w¹ ₄ from memory, converts the digital value into analog signal W₁, and transmits analog signal W₁ along column signal line 432 ¹. DCO 430 ¹ also determines the sign of element w¹ ₄, and the generates and transmits SGN₁ signal along the first sign signal line. DCO 430 ² receives element w² ₄ from memory, converts the digital value into analog signal W₂, and transmits analog signal W₂ along column signal line 432 ². DCO 430 ² also determines the sign of element w² ₄, and the generates and transmits SGN₂ signal along the first sign signal line. DCO 430 ³ receives element w³ ₄ from memory, converts the digital value into analog signal W₃, and transmits analog signal W₃ along column signal line 432 ³. DCO 430 ³ also determines the sign of element w³ ₄, and the generates and transmits SGN₃ signal along the first sign signal line. DCO 430 ⁴ receives element w⁴ ₄ from memory, converts the digital value into analog signal W₄, and transmits analog signal W₄ along column signal line 432 ⁴. DCO 430 ⁴ also determines the sign of element w⁴ ₄, and the generates and transmits SGN₄ signal along the first sign signal line.

Mixed-signal MAC unit 440.M₁ receives analog signal A₁ along row signal line 422 ¹, analog signal W₁ along column signal line 432 ¹, and the signal SGN₁ along the first signal line, and multiplies W₁ (w¹ ₄) and A₁ (a₇) by incrementing or decrementing the counter circuit based on the interaction of signals W₁ and A₁ determined by the ICG and the value of the signal SGN₁. The counter circuit begins the fourth processing cycle with a value of w¹ ₃·a₆+w¹ ₂·a₂+w¹ ₁·a₁, and ends the fourth processing cycle with a value equal to w¹ ₄·a₇+w¹ ₃·a₆+w¹ ₂·a₂+w¹ ₁·a₁, which is the fourth intermediate value of the accumulated result for the dot product calculation. The counter circuit value is then output as the result of the dot product calculation for element o¹ ₁.

Mixed-signal MAC unit 440.M₂ receives analog signal A₁ along row signal line 422 ¹, analog signal W₂ along column signal line 432 ², and the signal SGN₂ along the second signal line, and multiplies W₂ (w² ₄) and A₁ (a₇) by incrementing or decrementing the counter circuit based on the interaction of signals W₂ and A₂ determined by the ICG and the value of the signal SGN₂. The counter circuit begins the fourth processing cycle with a value of w² ₃·a₆+w² ₂·a₂+w² ₁·a₁, and ends the fourth processing cycle with a value equal to w² ₄·a₇+w² ₃·a₆+w² ₂·a₂+w² ₁·a₁. The counter circuit value is then output as the result of the dot product calculation for element o² ₁.

Mixed-signal MAC unit 440.M₃ receives analog signal A₁ along row signal line 422 ¹, analog signal W₃ along column signal line 432 ³, and the signal SGN₃ along the third signal line, and multiplies W₃ (w³ ₄) and A₁ (a₇) by incrementing or decrementing the counter circuit based on the interaction of signals W₃ and A₃ determined by the ICG and the value of the signal SGN₃. The counter circuit begins the third processing cycle with a value of w³ ₃·a₆+w³ ₂·a₂+w³ ₁·a₁, and ends the third processing cycle with a value equal to w³ ₄·a₇+w³ ₃·a₆+w³ ₂·a₂+w³ ₁·a₁. The counter circuit value is then output as the result of the dot product calculation for element o³ ₁.

Mixed-signal MAC unit 440.M₄ receives analog signal A₁ along row signal line 422 ¹, analog signal W₄ along column signal line 432 ⁴, and the signal SGN₄ along the fourth signal line, and multiplies W₄ (w⁴ ₄) and A₁ (a₇) by incrementing or decrementing the counter circuit based on the interaction of signals W₄ and A₄ determined by the ICG and the value of the signal SGN₄. The counter circuit begins the third processing cycle with a value of w⁴ ₃·a₆+w⁴ ₂·a₂+w⁴ ₁·a₁, and ends the third processing cycle with a value equal to w⁴ ₄·a₇+w⁴ ₃·a₆+w⁴ ₂·a₂+w⁴ ₁·a₁. The counter circuit value is then output as the result of the dot product calculation for element o⁴ ₁.

The second row of mixed-signal MAC array 418 includes mixed-signal MAC units 440.M₅, 440.M₆, 440.M₇ and 440.M₈.

Mixed-signal MAC unit 440.M₅ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the second column of converted input data matrix 214 to generate element o¹ ₂ of converted output data matrix 216. Mixed-signal MAC unit 440.M₆ calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the second column of converted input data matrix 214 to generate element o² ₂ of converted output data matrix 216. Mixed-signal MAC unit 440.M₇ calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the second column of converted input data matrix 214 to generate element o³ ₂ of converted output data matrix 216. Mixed-signal MAC unit 440.M₈ calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the second column of converted input data matrix 214 to generate element o⁴ ₂ of converted output data matrix 216. These dot product calculations are performed in the same manner as those described above for the first row of mixed-signal MAC array 418, but use the second column of converted input data matrix 214 rather than the first column.

The third row of mixed-signal MAC array 418 includes mixed-signal MAC units 440.M₉, 440.M₁₀, 440.M₁₁ and 440.M₁₂.

Mixed-signal MAC unit 440.M₉ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the third column of converted input data matrix 214 to generate element o¹ ₃ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₀ calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the third column of converted input data matrix 214 to generate element o² ₃ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₁ calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the third column of converted input data matrix 214 to generate element o³ ₃ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₂ calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the third column of converted input data matrix 214 to generate element o⁴ ₃ of converted output data matrix 216. These dot product calculations are performed in the same manner as those described above for the first row of mixed-signal MAC array 418, but use the third column of converted input data matrix 214 rather than the first column.

The fourth row of mixed-signal MAC array 418 includes mixed-signal MAC units 440.M₁₃, 440.M₁₄, 440.M₁₅ and 440.M₁₆.

Mixed-signal MAC unit 440.M₉ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight set 212 ¹) and the fourth column of converted input data matrix 214 to generate element o¹ ₄ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₀ calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight set 212 ²) and the fourth column of converted input data matrix 214 to generate element o² ₄ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₁ calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight set 212 ³) and the fourth column of converted input data matrix 214 to generate element o³ ₄ of converted output data matrix 216. Mixed-signal MAC unit 440.M₁₂ calculates the dot product of the fourth row of converted weight matrix 212 (i.e., converted weight set 212 ⁴) and the fourth column of converted input data matrix 214 to generate element o⁴ ₄ of converted output data matrix 216. These dot product calculations are performed in the same manner as those described above for the first row of mixed-signal MAC array 418, but use the fourth column of converted input data matrix 214 rather than the first column.

After the first quadrant a_(q1) of converted input data matrix 214 has been processed, the next sequence of operations may begin in order to process the second quadrant a_(q2) of converted input data matrix 214. After the second quadrant a_(q2) of converted input data matrix 214 has been processed, the next sequence of operations may begin in order to process the third quadrant a_(q3) of converted input data matrix 214. And, after the third quadrant a_(q3) of converted input data matrix 214 has been processed, the final sequence of operations may begin in order to process the fourth quadrant a_(q4) of converted input data matrix 214. Converted weight matrix 212 is accessed for each sequence of operations.

An activation function and a bias may be applied to each element of the converted output data matrix to generate the elements of output feature maps 206, which are then provided as input feature maps 204 to the next layer. The activation function and bias may be applied after each element of the converted output data matrix is calculated, after all of the elements of the converted output data matrix are calculated, or by a subsequent locally-connected layer, such as an ReLU layer.

FIG. 5 depicts a block diagram 410 of mixed-signal MAC unit 440, in accordance with an embodiment of the present disclosure.

In one embodiment, mixed-signal MAC unit 440 includes ICG 442, counter circuit 443, and output multiplexer 448. ICG 442 includes an enable port (i.e., EN), a clock port (i.e., CLK) and an output port (i.e., OUT). The enable port is coupled row signal line 422 ^(i), and the clock port is coupled to column signal line 432 ^(i). Counter circuit 443 has an input port (i.e., IN), a sign signal port (i.e., SGN) and an output port (i.e., OUT). The input port is coupled to the output port of ICG 442, the sign signal port is coupled to the sign signal line that conveys the signal SGN_(i), and the output port is coupled to output multiplexer 448. Counter circuit 443 includes register 444 and adder/subtractor circuit 446. Register 444 has a trigger port (i.e., TRG), an input port (i.e., IN) and an output port (i.e., OUT). The trigger port is coupled to the output port of ICG 442, and the output port is coupled to the output port of counter circuit 443. Adder/subtractor circuit 446 has a control port (i.e., C), an input port (i.e., IN) and an output port (i.e., OUT). The control port is coupled to the sign signal port of counter circuit 443, the input port is coupled to the output port of register 444, and the output port is coupled to the input port of register 444.

Mixed-signal MAC unit 440 receives two analog operand signals and the sign signal, i.e., signals A_(i), W_(i) and SGN_(i). In certain embodiments, in order to maintain analog operand signal quality throughout mixed-signal MAC array 418, analog signal A_(i) may be conditioned by buffers 441 ¹ and 441 ², and analog signal W_(i) may be conditioned by buffers 441 ³ and 441 ⁴. Buffers 441 ¹ and 441 ² are coupled to row signal line 422 ^(i), and buffers 441 ³ and 441 ⁴ are coupled to column signal line 432 ^(i).

During each processing cycle, DTC 420 ^(i) receives an activation element a_(k) from memory, encodes the digital value of a_(k) into a pulse-width modulated (PWM) signal A_(i), and transmits PWM signal A_(i) along row signal line 422 ^(i). Similarly, DCO 430 ^(i) receives a weight element w¹ _(j) from memory, encodes the digital value of w¹ _(j) into a frequency modulated (FM) signal W_(i), and transmits FM signal W_(i) along column signal line 432 ^(i). DCO 430 ^(i) also determines the sign of weight element w^(i) _(j), generates and transmits the signal SGN_(i) along the i^(th) sign signal line. When the sign of the weight element w¹ _(j) is positive, the signal SGN_(i) is set to a first value (e.g., 1). Conversely, when the sign of the weight element w¹ _(j) is negative, the signal SGN_(i) is set to a second value (e.g., 0). The signal SGN_(i) may be a digital signal or an analog signal, and other values indicating whether counter circuit 443 increments or decrements that count value are also contemplated.

ICG 442 receives FM signal W_(i) at the clock port, receives PWM signal A_(i) at the enable port, gates the FM signal W_(i) using the PWM signal A_(i), and generates a digital product signal that is provided to counter circuit 443. The digital product signal includes a number of clock signal rising edges, which are integrated or counted by counter circuit 443. Register 444 receives the digital product signal at the trigger port, and, for each clock signal rising edge, register 444 outputs the count value from the output port, receives an updated count value at the input port, and stores the updated count value. Adder/subtractor circuit 446 receives the signal SGN_(i) at the control port, and the count value at the input port. When the signal SGN_(i) has the first value (e.g., 1), adder/subtractor circuit 446 increments the count value to generate the updated count value. Conversely, when the signal SGN_(i) has the second value (e.g., 0), adder/subtractor circuit 446 decrements the count value to generate the updated count value. In one embodiment, adder/subtractor circuit 446 adds 1 to the count value to increment the count value, and subtracts 1 from the count value to decrement the count value. The updated count value is then output from the output port. In other words, the count value stored in register 444 will be incremented or decremented every time the clock signal rising edge occurs.

PWM signal A_(i) is a binary signal having a fixed signal period and a programmable pulse width or duty cycle, i.e., PWM signal A_(i) is “on” for a programmed amount of time during the PWM signal period and “off” during the remaining time of the PWM signal period. When PWM signal A_(i) is “on,” the voltage of PWM signal A_(i) is high, and when PWM signal A_(i) is “off,” the voltage of PWM signal A_(i) is low. Generally, the pulse width of PWM signal A_(i) may be programmed to represent a digital activation value by dividing the signal period into a number of discrete pulse widths based on the bit-width of the digital value, and programming the pulse width to the digital activation value. For example, for a signal period of 1 second, a digital activation value having a bit-width of 8 bits (unsigned) and a value of 32, the signal period may be divided into 255 pulse widths (i.e., 2⁸−1), and the pulse width may be programmed to 32/255=0.125490 seconds (12.5% duty cycle). The smallest pulse width is 1/255 seconds.

FM signal W_(i) is a sinusoidal signal having a programmable signal frequency. Generally, the signal frequency of FM signal W_(i) may be programmed to represent a digital weight value based on the smallest pulse width of PWM signal A_(i) and the bit-width of the digital weight value. For example, for a PWM signal A_(i) with a smallest pulse width of 1/255 seconds (equivalent to 255 Hz), a digital weight value having a bit-width of 8 bits (unsigned) and a value of 32, the programmable frequencies may be divided into 255 discrete frequencies (i.e., 2⁸−1), each discrete frequency is a multiple of 255 Hz based on the smallest pulse width, and the frequency may be programmed to 32·255=8,160 Hz. The smallest frequency is 255 Hz.

In this example, ICG 442 gates the FM signal W_(i) at 8,160 Hz using the PWM signal A_(i) with a duty cycle of 12.5%. This process generates (8,160·0.125490)=1,024 clock edges that are output to counter circuit 443, which integrates or counts the number of edges to arrive at 1,024, which is the product of the digital activation value of 32 and the digital weight value of 32.

In another example, a digital activation value of 16 produces a pulse width of 0.062745 seconds (6.27% duty cycle), and a digital weight value of 64 produces a programmable frequency of 16,320 Hz. The ICG gating process generates 16,320·0.062745=1,024 clock edges that are output to counter circuit 443, which integrates or counts the number of edges to arrive at 1,024, which is the product of the digital activation value of 16 and the digital weight value of 64.

In a further example, a digital activation value of 1 produces a pulse width of 0.0039215 seconds (0.39% duty cycle), a digital weight value of 1 produces a programmable frequency of 255 Hz. The ICG gating process generates 255·0.0039215=1 clock edge that is output to counter circuit 443, which integrates or counts the number of edges to arrive at 1, which is the product of the digital activation value of 1 and the digital weight value of 1.

In a final example, a digital activation value of 127 produces a pulse width of 0.498039 seconds (49.8% duty cycle), a digital weight value of 127 produces a programmable frequency of 32,385 Hz. The ICG gating process generates 32,385·0.498039=16,129 clock edges that are output to counter circuit 443, which integrates or counts the number of edges to arrive at 16,129, which is the product of the digital activation value of 127 and the digital weight value of 127.

Generally, ICG 442 and counter circuit 443 implements a MAC operation given by Equation 1:

ACC[N]=FW·TA+ACC[N−1]  Eq. 1

where ACC[N] is the current count value, ACC[N−1] is the previous count value, FW is the FM signal W_(i), and TA is the PWM signal A_(i).

In other words, signals W_(i) and A_(i) are multiplied together by incrementing or decrementing the count value stored in register 444 based on the interaction of signals W_(i) and A_(i) determined by ICG 442 and the value of the signal SGN_(i). The count value stored within register 444 represents an intermediate value of the accumulated result until the dot product calculation is complete. After the processing sequence for the dot product calculation is complete, the final count values may be shifted out from each mixed-signal MAC unit 440. In certain embodiments, this process may be similar to shifting out the accumulator values from a systolic, output stationary MAC array. In other embodiments, each mixed-signal MAC unit 440 may be directed coupled to an output register, and all of the final count values may be output at the same time.

Mixed-signal MAC units 440 provide many advantages over digital MAC units and analog NVM cells. Analog operand distribution does not require delay registers or flip-flops, and the number of wires per operand is reduced from 8 to 1. Most of the weights are zero or close to zero, so toggling on this wire should be low on average. The mixed-signal MAC operation is performed by an ICG and a simple counter circuit. For example, the digital MAC operation may be represented as a multiplication of two 8-bit operands followed by a 32-bit addition and storage of the result in a 32-bit accumulation register, i.e., 8·8+32»32, whereas the mixed-signal MAC operation may be represented as incrementing or decrementing a simple 32-bit counter by a particular amount, i.e., ±32.

Additionally, because the power consumption of the datapath is heavily proportional to the operand magnitudes and most weights are zero or small, non-zero values, mixed-signal MAC units 440 provide a big advantage over digital MAC units. Mixed-signal MAC units 440 do not require global clock distribution and pipelining latency for operands. And, while calibration may be required for DTCs 420 and DCOs 430 over pressure, voltage and temperature, i.e., PVT, compared to full analog voltage/current schemes, no ADCs are required which advantageously reduces area and power consumption.

The embodiment of mixed-signal MAC unit 440 discussed above has output stationary dataflow, frequency modulated weight operands and pulse width modulated activations. Other embodiments may include weight stationary or input stationary dataflow, as well as different encodings of the analog operands, such as, for example, digital, voltage, current, frequency, pulse width and pulse/edge position. These encoding may be used in various combinations for the two analog operands. Additionally, further embodiments contemplate three-dimensional (or higher) mixed-signal MAC arrays.

FIG. 6 depicts a block diagram of system 100, in accordance with an embodiment of the present disclosure.

Computer 102 includes communication bus 110 coupled to one or more processors 120, memory 130, I/O interfaces 140, display interface 150, one or more communication interfaces 160 and one or more ANN accelerators 170. Generally, I/O interfaces 140 are coupled to I/O devices 142 using a wired or wireless connection, display interface 150 is coupled to display 152, and communication interface 160 is connected to network 162 using a wired or wireless connection.

Communication bus 110 is a communication system that transfers data between processor 120, memory 130, I/O interfaces 140, display interface 150, communication interface 160, ANN accelerator 170, as well as other components not depicted. Power connector 112 is coupled to communication bus 110 and a power supply (not shown).

Processor 120 includes one or more general-purpose or application-specific microprocessors that executes instructions to perform control, computation, input/output, etc. functions for computer 102. Processor 120 may include a single integrated circuit, such as a micro-processing device, or multiple integrated circuit devices and/or circuit boards working in cooperation to accomplish the functions of processor 120. In addition, processor 120 may execute computer programs or modules, such as operating system 132, software modules 134, etc., stored within memory 130. For example, software modules 134 may include an ML application, an ANN application, a CNN application, etc.

Generally, storage element or memory 130 stores instructions for execution by processor 120 and data. Memory 130 may include a variety of non-transitory computer-readable medium that may be accessed by processor 120. In various embodiments, memory 130 may include volatile and nonvolatile medium, non-removable medium and/or removable medium. For example, memory 130 may include any combination of random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), read only memory (ROM), flash memory, cache memory, and/or any other type of non-transitory computer-readable medium.

Memory 130 contains various components for retrieving, presenting, modifying, and storing data. For example, memory 130 stores software modules that provide functionality when executed by processor 120. The software modules include operating system 132 that provides operating system functionality for computer 102. Software modules 134 provide various functionality, such as image classification using convolutional neural networks, etc. Data 136 may include data associated with operating system 132, software modules 134, etc.

I/O interfaces 140 are configured to transmit and/or receive data from I/O devices 142. I/O interfaces 140 enable connectivity between processor 120 and I/O devices 142 by encoding data to be sent from processor 120 to I/O devices 142, and decoding data received from I/O devices 142 for processor 120. Generally, data may be sent over wired and/or wireless connections. For example, I/O interfaces 140 may include one or more wired communications interfaces, such as USB, Ethernet, etc., and/or one or more wireless communications interfaces, coupled to one or more antennas, such as WiFi, Bluetooth, cellular, etc.

Generally, I/O devices 142 provide input to computer 102 and/or output from computer 102. As discussed above, I/O devices 142 are operably connected to computer 102 using a wired and/or wireless connection. I/O devices 142 may include a local processor coupled to a communication interface that is configured to communicate with computer 102 using the wired and/or wireless connection. For example, I/O devices 142 may include a keyboard, mouse, touch pad, joystick, etc.

Display interface 150 is configured to transmit image data from computer 102 to monitor or display 152.

Communication interface 160 is configured to transmit data to and from network 162 using one or more wired and/or wireless connections. Network 162 may include one or more local area networks, wide area networks, the Internet, etc., which may execute various network protocols, such as, for example, wired and/or wireless Ethernet, Bluetooth, etc. Network 162 may also include various combinations of wired and/or wireless physical layers, such as, for example, copper wire or coaxial cable networks, fiber optic networks, Bluetooth wireless networks, WiFi wireless networks, CDMA, FDMA and TDMA cellular wireless networks, etc.

FIG. 7 depicts an ANN accelerator 170, in accordance with an embodiment of the present disclosure.

ANN accelerator 170 is configured to execute machine learning models, such as, for example, ANNs, CNNs, RNNs, etc., in support of various applications embodied by software modules 134. Generally, ANN accelerator 170 may include one or more processors, coprocessors, processing engines (PEs), compute engines (CEs), etc., such as, for example, CPUs, MCUs, GPUs, NPUs, such as, for example, the ARM Machine Learning (ML) Processor, DSPs, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), controllers, microcontrollers, matrix multiplier circuits, MAC arrays, etc. Generally, ANN accelerator 170 receives input data from memory 130 over communication bus 110, and transmits output data to memory 130 over communication bus 110.

ANN accelerator 170 also includes controller 172, communications bus interface 174, and one or more non-volatile and/or volatile memories 176, such as, for example, ROM, flash memory, SRAM, DRAM, etc. Controller 172 is coupled to communication bus interface 174, memory 176 and one or more PEs 180, and generally controls the functionality, data flow, etc., of ANN accelerator 170. Memory 176 is coupled to communication bus interface 174 and PEs 180, and stores, inter alia, ANN weights and activations. Each PE 180 includes one or more mixed-signal MAC arrays 418, and each mixed-signal MAC array 418 includes a number of mixed-signal MAC units 440, such as, for example, 4 mixed-signal MAC units 440, 8 mixed-signal MAC units 440, 16 mixed-signal MAC units 440 (e.g., 440.M₁ to 440.M₁₆, as depicted in FIG. 7), 32 mixed-signal MAC units 440, etc.

FIG. 8 depicts a depict flow diagram 500 representing functionality associated with performing a mixed-signal MAC operation, in accordance with an embodiment of the present disclosure.

ICG 442 includes a clock port coupled to a first analog operand signal line, an enable port coupled to a second analog operand signal line, and an output port. The functionality at 510, 520, 530 and 540 is performed at ICG 442.

At 510, a first analog operand signal is received at the clock port.

At 520, a second analog operand signal is received at the enable port.

At 530, a digital product signal is generated based on the first analog operand signal and the second analog operand signal.

At 540, the digital product signal is output from the output port.

Counter circuit 443 includes an input port coupled to the ICG output port, register 444, adder/subtractor circuit 446, and an output port. The functionality at 550, 560 and 570 is performed at counter circuit 443.

At 550, the digital product signal is received at the input port.

At 560, a count value stored in register 444 is incremented or decremented based on the digital product signal.

At 570, the count value is output from the output port.

Embodiments of the present disclosure advantageously improve the distribution of operands in ANN accelerators by eliminating the need for flip-flops, ADCs and DACs.

The embodiments described herein are combinable.

In one embodiment, an artificial neural network (ANN) accelerator includes a plurality of digital controlled oscillators (DCOs), a plurality of digital-to-time converters (DTCs), and a mixed-signal MAC array, coupled to row signal lines and column signal lines, including a plurality of mixed-signal MAC units. Each DCO is configured to receive a first digital data value, generate a first analog operand signal based on the first digital data value, and transmit the first analog operand signal along a respective column signal line. Each DTC is configured to receive a second digital data value, generate a second analog operand signal based on the second digital data value, and transmit the second analog operand signal along a respective row signal line. Each mixed-signal MAC unit includes an integrated clock gate (ICG) and a counter circuit. The ICG includes a clock port coupled to a column signal line, an enable port coupled to a row signal line, and an output port, and the ICG is configured to receive, at the clock port, the first analog operand signal transmitted along the column signal line, receive, at the enable port, the second analog operand signal transmitted along the row signal line, generate a digital product signal based on the first analog operand signal and the second analog operand signal, and output, from the output port, the digital product signal. The counter circuit includes an input port coupled to the ICG output port, a register, an adder/subtractor circuit, and an output port, and the counter circuit configured to receive, at the input port, the digital product signal, increment or decrement a count value stored in the register based on the digital product signal, and output, from the output port, the count value.

In another embodiment of the ANN accelerator, the first analog operand signal generated by each DCO is a frequency modulated (FM) signal; and the second analog operand signal generated by each DTC is a pulse-width modulated (PWM) signal.

In another embodiment of the ANN accelerator, generating a digital product signal includes gating the FM signal using the PWM signal.

In another embodiment of the ANN accelerator, the first digital data values are ANN weight values, and the second digital data values are ANN activation values.

In another embodiment of the ANN accelerator, the counter circuit includes a sign signal port, and the counter circuit is configured to receive, at the sign signal port, a sign signal having a first value or a second value, increment the count value of the register when the sign signal has the first value, and decrement the count value of the register when the sign signal has the second value.

In another embodiment of the ANN accelerator, each DCO is coupled to a respective sign signal line associated with the respective column signal line; each DCO is configured to generate a sign signal based on the first digital data value, the sign signal having the first value or the second value, and transmit the sign signal along the respective sign signal line; and, in each mixed-signal MAC unit, the counter circuit sign signal port is coupled to the sign signal line associated with the column signal line coupled to the ICG clock port.

In another embodiment of the ANN accelerator, the digital product signal includes a plurality of clock signal rising edges; the register includes a trigger port coupled to the counter circuit input port, an input port, and an output port coupled to the counter circuit output port; the adder/subtractor circuit includes a control port coupled to the counter circuit sign signal port, an input port coupled to the register output port, and an output port coupled to the register input port; the register is configured to receive, at the trigger port, the digital product signal, and, for each clock signal rising edge, output, from the output port, the count value, receive, at the input port, an updated count value, and store the updated count value; and the adder/subtractor circuit is configured to receive, at the control port, the sign signal, receive, at the input port, the count value, increment the count value to generate the updated count value when the sign signal has the first value, decrement the count value to generate the updated count value when the sign signal has the second value, and output, from the output port, the updated count value.

In another embodiment of the ANN accelerator, the adder/subtractor circuit is configured to add 1 to the count value to increment the count value, and subtract 1 from the count value to decrement the count value.

In one embodiment, a mixed-signal multiply-and-accumulate (MAC) unit includes an integrated clock gate (ICG) and a counter circuit. The ICG includes a clock port coupled to a first analog operand signal line, an enable port coupled to a second analog operand signal line, and an output port, and the ICG configured to receive, at the clock port, a first analog operand signal, receive, at the enable port, a second analog operand signal, generate a digital product signal based on the first analog operand signal and the second analog operand signal, and output, from the output port, the digital product signal. The counter circuit includes an input port coupled to the ICG output port, a register, an adder/subtractor circuit, and an output port, and the counter circuit configured to receive, at the input port, the digital product signal, increment or decrement a count value stored in the register based on the digital product signal, and output, from the output port, the count value.

In another embodiment of the mixed-signal MAC unit, the first analog operand signal is a frequency modulated (FM) signal, and the second analog operand signal is a pulse-width modulated (PWM) signal.

In another embodiment of the mixed-signal MAC unit, generating a digital product signal includes gating the FM signal using the PWM signal.

In another embodiment of the mixed-signal MAC unit, the FM signal is based on an artificial neural network (ANN) weight value, and the PWM signal is based on an ANN activation value.

In another embodiment of the mixed-signal MAC unit, the counter circuit includes a sign signal port, and the counter circuit is configured to receive, at the sign signal port, a sign signal associated with the FM signal, the sign signal having a first value or a second value; increment the count value of the register when the sign signal has the first value; and decrement the count value of the register when the sign signal has the second value.

In another embodiment of the mixed-signal MAC unit, the digital product signal includes a plurality of clock signal rising edges; the register includes a trigger port coupled to the counter circuit input port, an input port, and an output port coupled to the counter circuit output port; the adder/subtractor circuit includes a control port coupled to the counter circuit sign signal port, an input port coupled to the register output port, and an output port coupled to the register input port; the register is configured to receive, at the trigger port, the digital product signal, and, for each clock signal rising edge, output, from the output port, the count value, receive, at the input port, an updated count value, and store the updated count value; and the adder/subtractor circuit is configured to receive, at the control port, the sign signal, receive, at the input port, the count value, increment the count value when the sign signal has the first value to generate the updated count value, decrement the count value when the sign signal has the second value to generate the updated count value, and output, from the output port, the updated count value.

In another embodiment of the mixed-signal MAC unit, the adder/subtractor circuit is configured to add 1 to the count value to increment the count value, and subtract 1 from the count value to decrement the count value.

In one embodiment, a method for performing a mixed-signal multiply-and-accumulate (MAC) operation includes, at an integrated clock gate (ICG) including a clock port coupled to a first analog operand signal line, an enable port coupled to a second analog operand signal line, and an output port: receiving, at the clock port, a first analog operand signal, receiving, at the enable port, a second analog operand signal, generating a digital product signal based on the first analog operand signal and the second analog operand signal, and outputting, from the output port, the digital product signal; and, at a counter circuit including an input port coupled to the ICG output port, a register, an adder/subtractor circuit, and an output port: receiving, at the input port, the digital product signal, incrementing or decrementing a count value stored in the register based on the digital product signal, and outputting, from the output port, the count value.

In another embodiment of the method, the first analog operand signal is a frequency modulated (FM) signal, and the second analog operand signal is a pulse-width modulated (PWM) signal.

In another embodiment of the method, generating a digital product signal includes gating the FM signal using the PWM signal.

In another embodiment of the method, the method further includes, at a digital controlled oscillator (DCO), receiving a first digital data value, generating the FM signal based on the first digital data value, generating a sign signal based on the first digital data value, the sign signal having a first value or a second value, transmitting the FM signal along the first analog operand signal line, and transmitting the sign signal along a sign signal line associated with the first analog operand signal line; at a digital-to-time converter (DTC): receiving a second digital data value, generating the PWM signal based on the second digital data value, and transmitting the PWM signal along the second analog operand signal line; and at the counter circuit: receiving the sign signal, incrementing the count value of the register when the sign signal has the first value, and decrementing the count value of the register when the sign signal has the second value.

In another embodiment of the method, the first digital data value is an artificial neural network (ANN) weight value, and the second digital data value is an ANN activation value.

While implementations of the disclosure are susceptible to embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the disclosure and not intended to limit the disclosure to the specific embodiments shown and described. In the description above, like reference numerals may be used to describe the same, similar or corresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.

The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive. Also, grammatical conjunctions are intended to express any and all disjunctive and conjunctive combinations of conjoined clauses, sentences, words, and the like, unless otherwise stated or clear from the context. Thus, the term “or” should generally be understood to mean “and/or” and so forth. References to items in the singular should be understood to include items in the plural, and vice versa, unless explicitly stated otherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting, referring instead individually to any and all values falling within the range, unless otherwise indicated, and each separate value within such a range is incorporated into the specification as if it were individually recited herein. The words “about,” “approximately,” or the like, when accompanying a numerical value, are to be construed as indicating a deviation as would be appreciated by one of ordinary skill in the art to operate satisfactorily for an intended purpose. Ranges of values and/or numeric values are provided herein as examples only, and do not constitute a limitation on the scope of the described embodiments. The use of any and all examples, or exemplary language (“e.g.,” “such as,” “for example,” or the like) provided herein, is intended merely to better illuminate the embodiments and does not pose a limitation on the scope of the embodiments. No language in the specification should be construed as indicating any unclaimed element as essential to the practice of the embodiments.

For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The description is not to be considered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as “first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” and the like, are words of convenience and are not to be construed as limiting terms. Also, the terms apparatus, device, system, etc. may be used interchangeably in this text.

The many features and advantages of the disclosure are apparent from the detailed specification, and, thus, it is intended by the appended claims to cover all such features and advantages of the disclosure which fall within the scope of the disclosure. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the disclosure to the exact construction and operation illustrated and described, and, accordingly, all suitable modifications and equivalents may be resorted to that fall within the scope of the disclosure. 

What is claimed is:
 1. An artificial neural network (ANN) accelerator, comprising: a plurality of digital controlled oscillators (DCOs), each DCO configured to receive a first digital data value, generate a first analog operand signal based on the first digital data value, and transmit the first analog operand signal along a respective column signal line; a plurality of digital-to-time converters (DTCs), each DTC configured to receive a second digital data value, generate a second analog operand signal based on the second digital data value, and transmit the second analog operand signal along a respective row signal line; and a mixed-signal MAC array, coupled to the row signal lines and the column signal lines, including a plurality of mixed-signal MAC units, each mixed-signal MAC unit including: an integrated clock gate (ICG), including a clock port, an enable port and an output port, the clock port coupled to a column signal line, the enable port coupled to a row signal line, the ICG configured to: receive, at the clock port, the first analog operand signal transmitted along the column signal line, receive, at the enable port, the second analog operand signal transmitted along the row signal line, generate a digital product signal based on the first analog operand signal and the second analog operand signal, and output, from the output port, the digital product signal, and a counter circuit, including an input port, a register, an adder/subtractor circuit and an output port, the input port coupled to the ICG output port, the counter circuit configured to: receive, at the input port, the digital product signal, increment or decrement a count value stored in the register based on the digital product signal, and output, from the output port, the count value.
 2. The ANN accelerator of claim 1, where: the first analog operand signal generated by each DCO is a frequency modulated (FM) signal; and the second analog operand signal generated by each DTC is a pulse-width modulated (PWM) signal.
 3. The ANN accelerator of claim 2, where said generate a digital product signal includes gating the FM signal using the PWM signal.
 4. The ANN accelerator of claim 3, where the first digital data values are ANN weight values, and the second digital data values are ANN activation values.
 5. The ANN accelerator of claim 3, where the counter circuit includes a sign signal port, and the counter circuit is configured to: receive, at the sign signal port, a sign signal having a first value or a second value; increment the count value of the register when the sign signal has the first value; and decrement the count value of the register when the sign signal has the second value.
 6. The ANN accelerator of claim 5, where: each DCO is coupled to a respective sign signal line associated with the respective column signal line; each DCO is configured to: generate a sign signal based on the first digital data value, the sign signal having the first value or the second value, and transmit the sign signal along the respective sign signal line; and in each mixed-signal MAC unit, the counter circuit sign signal port is coupled to the sign signal line associated with the column signal line coupled to the ICG clock port.
 7. The ANN accelerator of claim 6, where: the digital product signal includes a plurality of clock signal rising edges; the register includes a trigger port coupled to the counter circuit input port, an input port, and an output port coupled to the counter circuit output port; the adder/subtractor circuit includes a control port coupled to the counter circuit sign signal port, an input port coupled to the register output port, and an output port coupled to the register input port; the register is configured to: receive, at the trigger port, the digital product signal, and for each clock signal rising edge: output, from the output port, the count value, receive, at the input port, an updated count value, and store the updated count value; and the adder/subtractor circuit is configured to: receive, at the control port, the sign signal, receive, at the input port, the count value, when the sign signal has the first value, increment the count value to generate the updated count value, when the sign signal has the second value, decrement the count value to generate the updated count value, and output, from the output port, the updated count value.
 8. The ANN accelerator of claim 7, where the adder/subtractor circuit is configured to add 1 to the count value to increment the count value, and subtract 1 from the count value to decrement the count value.
 9. A mixed-signal multiply-and-accumulate (MAC) unit, comprising: an integrated clock gate (ICG), including a clock port, an enable port and an output port, the clock port coupled to a first analog operand signal line, the enable port coupled to a second analog operand signal line, the ICG configured to: receive, at the clock port, a first analog operand signal, receive, at the enable port, a second analog operand signal, generate a digital product signal based on the first analog operand signal and the second analog operand signal, and output, from the output port, the digital product signal; and a counter circuit, including an input port, a register, an adder/subtractor circuit and an output port, the input port coupled to the ICG output port, the counter circuit configured to: receive, at the input port, the digital product signal, increment or decrement a count value stored in the register based on the digital product signal, and output, from the output port, the count value.
 10. The mixed-signal MAC unit of claim 9, where: the first analog operand signal is a frequency modulated (FM) signal; and the second analog operand signal is a pulse-width modulated (PWM) signal.
 11. The mixed-signal MAC unit of claim 10, where said generate a digital product signal includes gating the FM signal using the PWM signal.
 12. The mixed-signal MAC unit of claim 11, where the FM signal is based on an artificial neural network (ANN) weight value, and the PWM signal is based on an ANN activation value.
 13. The mixed-signal MAC unit of claim 11, where the counter circuit includes a sign signal port, and the counter circuit is configured to: receive, at the sign signal port, a sign signal associated with the FM signal, the sign signal having a first value or a second value; increment the count value of the register when the sign signal has the first value; and decrement the count value of the register when the sign signal has the second value.
 14. The mixed-signal MAC unit of claim 13, where the digital product signal includes a plurality of clock signal rising edges; the register includes a trigger port coupled to the counter circuit input port, an input port, and an output port coupled to the counter circuit output port; the adder/subtractor circuit includes a control port coupled to the counter circuit sign signal port, an input port coupled to the register output port, and an output port coupled to the register input port; the register is configured to: receive, at the trigger port, the digital product signal, and for each clock signal rising edge: output, from the output port, the count value, receive, at the input port, an updated count value, and store the updated count value; and the adder/subtractor circuit is configured to: receive, at the control port, the sign signal, receive, at the input port, the count value, when the sign signal has the first value, increment the count value to generate the updated count value, when the sign signal has the second value, decrement the count value to generate the updated count value, and output, from the output port, the updated count value.
 15. The mixed-signal MAC unit of claim 14, where the adder/subtractor circuit is configured to add 1 to the count value to increment the count value, and subtract 1 from the count value to decrement the count value.
 16. A method for performing a mixed-signal multiply-and-accumulate (MAC) operation, comprising: at an integrated clock gate (ICG) including a clock port, an enable port and an output port, the clock port coupled to a first analog operand signal line, the enable port coupled to a second analog operand signal line: receiving, at the clock port, a first analog operand signal, receiving, at the enable port, a second analog operand signal, generating a digital product signal based on the first analog operand signal and the second analog operand signal, and outputting, from the output port, the digital product signal; and at a counter circuit including an input port coupled to the ICG output port, a register, an adder/subtractor circuit and an output port: receiving, at the input port, the digital product signal, incrementing or decrementing a count value stored in the register based on the digital product signal, and outputting, from the output port, the count value.
 17. The method of claim 16, where: the first analog operand signal is a frequency modulated (FM) signal; and the second analog operand signal is a pulse-width modulated (PWM) signal.
 18. The method of claim 17, where said generating a digital product signal includes gating the FM signal using the PWM signal.
 19. The method of claim 18, further comprising: at a digital controlled oscillator (DCO): receiving a first digital data value, generating the FM signal based on the first digital data value, generating a sign signal based on the first digital data value, the sign signal having a first value or a second value, transmitting the FM signal along the first analog operand signal line, and transmitting the sign signal along a sign signal line associated with the first analog operand signal line; at a digital-to-time converter (DTC): receiving a second digital data value, generating the PWM signal based on the second digital data value, and transmitting the PWM signal along the second analog operand signal line; and at the counter circuit: receiving the sign signal, incrementing the count value of the register when the sign signal has the first value, and decrementing the count value of the register when the sign signal has the second value.
 20. The method of claim 19, where the first digital data value is an artificial neural network (ANN) weight value, and the second digital data value is an ANN activation value. 